Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a totally managed service that builds upon Apache Airflow, providing its advantages whereas eliminating the necessity so that you can arrange, function, and keep the underlying infrastructure, lowering operational overhead whereas growing safety and resilience.
At present, we’re excited to announce an enhancement to the Amazon MWAA integration with the Airflow REST API. This enchancment streamlines the power to entry and handle your Airflow environments and their integration with exterior programs, and lets you work together along with your workflows programmatically. The Airflow REST API facilitates a variety of use circumstances, from centralizing and automating administrative duties to constructing event-driven, data-aware information pipelines.
On this publish, we talk about the enhancement and current a number of use circumstances that the enhancement unlocks to your Amazon MWAA atmosphere.
Airflow REST API
The Airflow REST API is a programmatic interface that lets you work together with Airflow’s core functionalities. It’s a set of HTTP endpoints to carry out operations reminiscent of invoking Directed Acyclic Graphs (DAGs), checking activity statuses, retrieving metadata about workflows, managing connections and variables, and even initiating dataset-related occasions, with out immediately accessing the Airflow internet interface or command line instruments.
Earlier than immediately, Amazon MWAA offered the inspiration for interacting with the Airflow REST API. Although purposeful, the method of acquiring and managing entry tokens and session cookies added complexity to the workflow. Amazon MWAA now helps a simplified mechanism for interacting with the Airflow REST API utilizing AWS credentials, considerably lowering complexity and bettering total usability.
Enhancement overview
The brand new InvokeRestApi functionality lets you run Airflow REST API requests with a legitimate SigV4 signature utilizing your present AWS credentials. This function is now accessible to all Amazon MWAA environments (2.4.3+) in supported Amazon MWAA AWS Areas. By appearing as an middleman, this REST API processes requests on behalf of customers, requiring solely the atmosphere identify and API request payload as inputs.
Integrating with the Airflow REST API via the improved Amazon MWAA API supplies a number of key advantages:
- Simplified integration – The brand new
InvokeRestApi
functionality in Amazon MWAA removes the complexity of managing entry tokens and session cookies, making it simple to work together with the Airflow REST API. - Improved usability – By appearing as an middleman, the improved API delivers Airflow REST API execution outcomes on to the shopper, lowering complexity and bettering total usability.
- Automated administration – The simplified REST API entry allows automating varied administrative and administration duties, reminiscent of managing Airflow variables, connections, slot swimming pools, and extra.
- Occasion-driven architectures – The improved API facilitates seamless integration with exterior occasions, enabling the triggering of Airflow DAGs based mostly on these occasions. This helps the rising emphasis on event-driven information pipelines.
- Information-aware scheduling – Utilizing the dataset-based scheduling function in Airflow, the improved API allows the Amazon MWAA atmosphere to handle the incoming workload and scale sources accordingly, bettering the general reliability and effectivity of event-driven pipelines.
Within the following sections, we reveal learn how to use the improved API in varied use circumstances.
Methods to use the improved Amazon MWAA API
The next code snippet exhibits the overall request format for the improved REST API:
The Identify
of the Amazon MWAA atmosphere, the Path
of the Airflow REST API endpoint to be known as, and the HTTP Technique
to make use of are the required parameters, whereas QueryParameters
and Physique
are non-obligatory and can be utilized as wanted within the API calls.
The next code snippet exhibits the overall response format:
The RestApiStatusCode
represents the HTTP standing code returned by the Airflow REST API name, and the RestApiResponse
incorporates the response payload from the Airflow REST API.
The next pattern code snippet showcases learn how to replace the outline subject of an Airflow variable utilizing the improved integration. The decision makes use of the AWS Python SDK to invoke the Airflow REST API for the duty.
To make the invoke_rest_api
SDK name, the calling shopper ought to have an AWS Id and Entry Administration (IAM) principal of airflow:InvokeRestAPI
connected to name the requisite atmosphere. The permission might be scoped to particular Airflow roles (Admin, Op, Person, Viewer, or Public) to regulate entry ranges.
This easy but highly effective REST API helps varied use circumstances to your Amazon MWAA environments. Let’s overview some vital ones within the subsequent sections.
Automate administration and administration duties
Previous to this launch, to automate configurations and setup of sources reminiscent of variables, connections, slot swimming pools, and extra, you needed to develop a prolonged boilerplate code to make API requests to the Amazon MWAA internet servers. You needed to deal with the cookie and session administration within the course of. You’ll be able to simplify this automation with the brand new enhanced REST API assist.
For this instance, let’s assume you wish to automate sustaining your Amazon MWAA atmosphere variables. It is advisable to carry out API operations reminiscent of create, learn, replace, and delete on Airflow variables to realize this activity. The next is an easy Python shopper to take action (mwaa_variables_client.py
):
Assuming that you’ve configured your terminal with acceptable AWS credentials, you’ll be able to run the previous Python script to realize the next outcomes:
Let’s additional discover different helpful use circumstances.
Construct event-driven information pipelines
The Airflow group has been actively innovating to reinforce the platform’s information consciousness, enabling you to construct extra dynamic and responsive workflows. After we introduced assist for model 2.9.2 in Amazon MWAA, we launched capabilities that permit pipelines to react to adjustments in datasets, each inside Airflow environments and in exterior programs. The brand new simplified integration with the Airflow REST API makes the implementation of data-driven pipelines extra simple.
Think about a use case the place it’s good to run a pipeline that makes use of enter from an exterior occasion. The next pattern DAG runs a bash command equipped as a parameter (any_bash_command.py)
:
With the assistance of the improved REST API, you’ll be able to create a shopper that may invoke this DAG, supplying the bash command of your selection as follows (mwaa_dag_run_client.py
):
The next snippet exhibits a pattern run of the script:
On the Airflow UI, the trigger_bash_command
activity exhibits the next execution log:
You’ll be able to additional develop this instance to extra helpful event-driven architectures. Let’s develop the use case to run your information pipeline and carry out extract, remodel, and cargo (ETL) jobs when a brand new file lands in an Amazon Easy Storage Service (Amazon S3) bucket in your information lake. The next diagram illustrates one architectural method.
Within the context of invoking a DAG via an exterior enter, the AWS Lambda perform would haven’t any information of how busy the Amazon MWAA internet server is, probably resulting in the perform overwhelming the Amazon MWAA internet server by processing numerous recordsdata in a brief timeframe.
One solution to regulate the file processing throughput could be to introduce an Amazon Easy Queue Service (Amazon SQS) queue between the S3 bucket and the Lambda perform, which may also help with fee limiting the API requests to the net server. You’ll be able to obtain this by configuring most concurrency for Lambda for the SQS occasion supply. Nonetheless, the Lambda perform would nonetheless be unaware of the processing capability accessible within the Amazon MWAA atmosphere to run the DAGs.
Along with the SQS queue, to assist afford the Amazon MWAA atmosphere handle the load natively, you need to use the Airflow’s data-aware scheduling function utilizing datasets. This method includes utilizing the improved Amazon MWAA REST API to create dataset occasions, that are then utilized by the Airflow scheduler to schedule the DAG natively. This fashion, the Amazon MWAA atmosphere can successfully batch the dataset occasions and scale sources based mostly on the load. Let’s discover this method in additional element.
Configure data-aware scheduling
Think about the next DAG that showcases a framework for an ETL pipeline (data_aware_pipeline.py
). It makes use of a dataset for scheduling.
Within the previous code snippet, a Dataset
object known as datalake is used to schedule the DAG. The get_resources
perform extracts the further
info that incorporates the areas of the newly added recordsdata within the S3 information lake. Upon receiving dataset occasions, the Amazon MWAA atmosphere batches the dataset occasions based mostly on the load and schedules the DAG to deal with them appropriately. The modified structure to assist the data-aware scheduling is offered under.
The next is a simplified shopper that may create a dataset occasion via the improved REST API (mwaa_dataset_client.py
):
The next is a code snippet for the Lambda perform within the previous structure to generate the dataset occasion, assuming the perform is configured to deal with one S3 PUT
occasion at a time (dataset_event_lambda.py
):
As new recordsdata get dropped into the S3 bucket, the Lambda perform will generate a dataset occasion per file, passing within the Amazon S3 location of the newly added recordsdata. The Amazon MWAA atmosphere will schedule the ETL pipeline upon receiving the dataset occasions. The next diagram illustrates a pattern run of the ETL pipeline on the Airflow UI.
The next snippet exhibits the execution log of the extract
activity from the pipeline. The log exhibits how the Airflow scheduler batched three dataset occasions collectively to deal with the load.
On this approach, you need to use the improved REST API to create data-aware, event-driven pipelines.
Issues
When implementing options utilizing the improved Amazon MWAA REST API, it’s vital to contemplate the next:
- IAM permissions – Be certain that the IAM principal making the
invoke_rest_api
SDK name has theairflow:InvokeRestAPI
permission on the Amazon MWAA useful resource. To regulate entry ranges, the permission might be scoped to particular Airflow roles (Admin, Op, Person, Viewer, or Public). - Error dealing with – Implement strong error dealing with mechanisms to deal with varied HTTP standing codes and error responses from the Airflow REST API.
- Monitoring and logging – Arrange acceptable monitoring and logging to trace the efficiency and reliability of your API-based integrations and information pipelines.
- Versioning and compatibility – Monitor the versioning of the Airflow REST API and the Amazon MWAA service to ensure your integrations stay suitable with any future adjustments.
- Safety and compliance – Adhere to your group’s safety and compliance necessities when integrating exterior programs with Amazon MWAA and dealing with delicate information.
You can begin utilizing the simplified integration with the Airflow REST API in your Amazon MWAA environments with Airflow model 2.4.3 or higher, in all at present supported Areas.
Conclusion
The improved integration between Amazon MWAA and the Airflow REST API represents a big enchancment within the ease of interacting with Airflow’s core functionalities. This new functionality opens up a variety of use circumstances, from centralizing and automating administrative duties, bettering total usability, to constructing event-driven, data-aware information pipelines.
As you discover this new function, think about the assorted use circumstances and finest practices outlined on this publish. By utilizing the brand new InvokeRestApi
, you’ll be able to streamline your information administration processes, improve operational effectivity, and drive higher worth out of your data-driven programs.
Concerning the Authors
Chandan Rupakheti is a Senior Options Architect at AWS. His primary focus at AWS lies within the intersection of analytics, serverless, and AdTech companies. He’s a passionate technical chief, researcher, and mentor with a knack for constructing modern options within the cloud. Outdoors of his skilled life, he loves spending time together with his household and buddies, and listening to and enjoying music.
Hernan Garcia is a Senior Options Architect at AWS based mostly out of Amsterdam. He has labored within the monetary companies business since 2018, specializing in utility modernization and supporting clients of their adoption of the cloud with a deal with serverless applied sciences.