17.3 C
New York
Sunday, May 4, 2025

Construct end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters


Apache Spark workloads working on Amazon EMR on EKS kind the inspiration of many trendy knowledge platforms. EMR on EKS presents advantages by offering managed Spark that integrates seamlessly with different AWS providers and your group’s current Kubernetes-based deployment patterns.

Knowledge platforms processing large-scale knowledge volumes typically require a number of EMR on EKS clusters. Within the put up Use Batch Processing Gateway to automate job administration in multi-cluster Amazon EMR on EKS environments, we launched Batch Processing Gateway (BPG) as an answer for managing Spark workloads throughout these clusters. Though BPG gives foundational performance to distribute workloads and assist routing for Spark jobs in multi-cluster environments, enterprise knowledge platforms require further options for a complete knowledge processing pipeline.

This put up reveals the right way to improve the multi-cluster answer by integrating Amazon Managed Workflows for Apache Airflow (Amazon MWAA) with BPG. By utilizing Amazon MWAA, we add job scheduling and orchestration capabilities, enabling you to construct a complete end-to-end Spark-based knowledge processing pipeline.

Overview of answer

Contemplate HealthTech Analytics, a healthcare analytics firm managing two distinct knowledge processing workloads. Their Medical Insights Knowledge Science staff processes delicate affected person end result knowledge requiring HIPAA compliance and devoted assets, and their Digital Analytics staff handles web site interplay knowledge with extra versatile necessities. As their operation grows, they face growing challenges in managing these numerous workloads effectively.

The corporate wants to keep up strict separation between protected well being info (PHI) and non-PHI knowledge processing, whereas additionally addressing completely different value middle necessities. The Medical Insights Knowledge Science staff runs important end-of-day batch processes that want assured assets, whereas the Digital Analytics staff can use cost-optimized spot cases for his or her variable workloads. Moreover, knowledge scientists from each groups require environments for experimentation and prototyping as wanted.

This situation presents an excellent use case for implementing an information pipeline utilizing Amazon MWAA, BPG, and a number of EMR on EKS clusters. The answer must route completely different Spark workloads to applicable clusters primarily based on safety necessities and value profiles, whereas sustaining the required isolation and compliance controls. To successfully handle such an setting, we want an answer that maintains clear separation between utility and infrastructure administration considerations and stitching collectively a number of parts into a strong pipeline.

Our answer consists of integrating Amazon MWAA with BPG by means of an Airflow customized operator for BPG known as BPGOperator. This operator encapsulates the infrastructure administration logic wanted to work together with BPG. BPGOperator gives a clear interface for job submission by means of Amazon MWAA. When executed, the operator communicates with BPG, which then routes the Spark workloads to obtainable EMR on EKS clusters primarily based on predefined routing guidelines.

The next structure diagram illustrates the parts and their interactions.

Image showing the end to end architecture for end-to-end pipeline

The answer works by means of the next steps:

  • Amazon MWAA executes scheduled DAGs utilizing BPGOperator. Knowledge engineers create DAGs utilizing this operator, requiring solely the Spark utility configuration file and fundamental scheduling parameters.
  • BPGOperator authenticates and submits jobs to the BPG submit endpoint POST:/apiv2/spark. It handles all HTTP communication particulars, manages authentication tokens, and gives safe transmission of job configurations.
  • BPG routes submitted jobs to EMR on EKS clusters primarily based on predefined routing guidelines. These routing guidelines are managed centrally by means of BPG configuration, permitting rules-based distribution of workloads throughout a number of clusters.
  • BPGOperator screens job standing, captures logs, and handles execution retries. It polls the BPG job standing endpoint GET:/apiv2/spark/{subID}/standing and streams logs to Airflow by polling the GET:/apiv2/log endpoint each second. The BPG log endpoint retrieves essentially the most present log info straight from the Spark Driver Pod.
  • The DAG execution progresses to subsequent duties primarily based on job completion standing and outlined dependencies. BPGOperator communicates the job standing by means of Airflow’s built-in process communication system, enabling complicated workflow orchestration.

Check with the BPG REST API interface documentation for added particulars.

This structure gives a number of key advantages:

  • Separation of tasks – Knowledge Engineering and Platform Engineering groups in enterprise organizations usually keep distinct tasks. The modular design on this answer allows platform engineers to configure BPGOperator and handle EMR on EKS clusters, whereas knowledge engineers keep DAGs.
  • Centralized code administrationBPGOperator encapsulates all core functionalities required for Amazon MWAA DAGs to submit Spark jobs by means of BPG right into a single, reusable Python module. This centralization minimizes code duplication throughout DAGs and improves maintainability by offering a standardized interface for job submissions.

Airflow customized operator for BPG

An Airflow Operator is a template for a predefined Process which you can outline declaratively inside your DAGs. Airflow gives a number of built-in operators corresponding to BashOperator, which executes bash instructions, PythonOperator, which executes Python capabilities, and EmrContainerOperator, which submits new jobs to an EMR on EKS cluster. Nonetheless, no built-in operators exist to implement all of the steps required for the Amazon MWAA integration with BPG.

Airflow means that you can create new operators to fit your particular necessities. This operator sort is called a customized operator. A customized operator encapsulates the customized infrastructure-related logic in a single, maintainable element. Customized operators are created by extending the airflow.fashions.baseoperator.BaseOperator class. We’ve developed and open sourced an Airflow customized operator for BPG known as BPGOperator, which implements the required steps to offer a seamless integration of Amazon MWAA with BPG.

The next class diagram gives an in depth view of the BPGOperator implementation.

Image showing class diagram for BPGOperator implementation

When a DAG features a BPGOperator process, the Amazon MWAA occasion triggers the operator to ship a job request to BPG. The operator usually performs the next steps:

  • Initialize job BPGOperator prepares the job payload, together with enter parameters, configurations, connection particulars, and different metadata required by BPG.
  • Submit job BPGOperator handles HTTP POST requests to submit jobs to BPG endpoints with the offered configurations.
  • Monitor job execution BPGOperator checks the job standing, polling BPG till the job completes efficiently or fails. The monitoring course of contains dealing with numerous job states, managing timeout situations, and responding to errors that happen throughout job execution.
  • Deal with job completion – Upon completion, BPGOperator captures the job outcomes, logs related particulars, and may set off downstream duties primarily based on the execution end result.

The next sequence diagram illustrates the interplay stream between the Airflow DAG, BPGOperator, and BPG.

Image showing sequence diagram for the interaction between the Airflow DAG, BPGOperator, and BPG.

Deploying the answer

Within the the rest of this put up, you’ll implement the end-to-end pipeline to run Spark jobs on a number of EMR on EKS clusters. You’ll start by deploying the widespread parts that function the inspiration for constructing the pipelines. Subsequent, you’ll deploy and configure BPG on an EKS cluster, adopted by deploying and configuring BPGOperator on Amazon MWAA. Lastly, you’ll execute Spark jobs on a number of EMR on EKS clusters from Amazon MWAA.

To streamline the setup course of, we’ve automated the deployment of all infrastructure parts required for this put up, so you possibly can give attention to the important elements of job submission to construct an end-to-end pipeline. We offer detailed info that will help you perceive every step, simplifying the setup whereas preserving the educational expertise.

To showcase the answer, you’ll create three clusters and an Amazon MWAA setting:

  • Two EMR on EKS clusters: analytics-cluster and datascience-cluster
  • An EKS cluster: gateway-cluster
  • An Amazon MWAA setting: airflow-environment

analytics-cluster and datascience-cluster function knowledge processing clusters that run Spark workloads, gateway-cluster hosts BPG, and airflow-environment hosts Airflow for job orchestration and scheduling.

You’ll find the code base within the GitHub repo.

Stipulations

Earlier than you deploy this answer, make it possible for the next stipulations are in place:

Arrange widespread infrastructure

This step handles the setup of networking infrastructure, together with digital non-public cloud (VPC) and subnets, together with the configuration of AWS Identification and Entry Administration (IAM) roles, Amazon Easy Storage Service (Amazon S3) storage, Amazon Elastic Container Registry (Amazon ECR) repository for BPG photographs, Amazon Aurora PostgreSQL-Appropriate Version database, Amazon MWAA setting, and each EKS and EMR on EKS clusters with a preconfigured Spark operator. With this infrastructure routinely provisioned, you possibly can consider the next steps with out getting caught up in fundamental setup duties.

  1. Clone the repository to your native machine and set the 2 setting variables. Change <AWS_REGION> with the AWS Area the place you wish to deploy these assets.
    git clone https://github.com/aws-samples/sample-mwaa-bpg-emr-on-eks-spark-pipeline.git
    cd sample-mwaa-bpg-emr-on-eks-spark-pipeline
    			
    export REPO_DIR=$(pwd)
    export AWS_REGION=<AWS_REGION>

  2. Execute the next script to create the widespread infrastructure:
    cd ${REPO_DIR}/infra
    ./setup.sh

  3. To confirm profitable infrastructure deployment, navigate to the AWS CloudFormation console, open your stack, and test the Occasions, Assets, and Outputs tabs for completion standing, particulars, and record of assets created.

You’ve accomplished the setup of the widespread parts that function the inspiration for remainder of the implementation.

Arrange Batch Processing Gateway

This part builds the Docker picture for BPG, deploys the helm chart on the gateway-cluster EKS cluster, and exposes the BPG endpoint utilizing Kubernetes service of sort LoadBalancer. Full the next steps:

  1. Deploy BPG on the gateway-cluster EKS cluster:
    cd ${REPO_DIR}/infra/bpg
    ./configure_bpg.sh

  2. Confirm the deployment by itemizing the pods and viewing the pod logs:
    kubectl get pods --namespace bpg
    kubectl logs <BPG-PODNAME> --namespace bpg

    Evaluation the logs and make sure there are not any errors or exceptions.

  3. Exec into the BPG pod and confirm the well being test:
    kubectl exec -it <BPG-PODNAME> -n bpg -- bash
    curl -u admin:admin localhost:8080/skatev2/healthcheck/standing

    The healthcheck API ought to return a profitable response of {"standing":"OK"}, confirming profitable deployment of BPG on the gateway-cluster EKS cluster.

We’ve efficiently configured BPG on gateway-cluster and arrange EMR on EKS for each datascience-cluster and analytics-cluster. That is the place we left off within the earlier weblog put up. Within the subsequent steps, we are going to configure Amazon MWAA with BPGOperator, after which write and submit DAGs to show an end-to-end Spark-based knowledge pipeline.

Configure the Airflow operator for BPG on Amazon MWAA

This part configures the BPGOperator plugin on the Amazon MWAA setting airflow-environment. Full the next steps:

  1. Configure BPGOperator on Amazon MWAA:
    cd ${REPO_DIR}/bpg_operator
    ./configure_bpg_operator.sh

  2. On the Amazon MWAA console, navigate to the airflow-environment setting.
  3. Select Open Airflow UI, and within the Airflow UI, select the Admin dropdown menu and select Plugins.
    You will note the BPGOperator plugin listed within the Airflow UI.
    Image showing BPGOperator plugin listed in the Airflow UI

Configure Airflow connections for BPG integration

This part guides you thru establishing the Airflow connections that allow safe communication between your Amazon MWAA setting and BPG. BPGOperator makes use of the configured connection to authenticate and work together with BPG endpoints.

Execute the next script to configure the Airflow connection bpg_connection.

cd $REPO_DIR/airflow
./configure_connections.sh

Within the Airflow UI, select the Admin dropdown menu and select Connections. You will note the bpg_connection listed within the Airflow UI.

Image showing Airflow Connections page with bpg_connection configured.

Configure the Airflow DAG to execute Spark jobs

This step configures an Airflow DAG to run a pattern utility. On this case, we are going to submit a DAG containing a number of pattern Spark jobs utilizing Amazon MWAA to EMR on EKS clusters utilizing BPG. Please watch for jiffy for the DAG to look within the Airflow UI.

cd $REPO_DIR/jobs
./configure_job.sh

Set off the Amazon MWAA DAG

On this step, we set off the Airflow DAG and observe the job execution conduct, together with reviewing the Spark logs within the Airflow UI:

  1. Within the Airflow UI, assessment the MWAASparkPipelineDemoJob DAG and select the play icon set off the DAG.
    Image showing sample Airflow Job, highlighting the play button to trigger the job
  2. Look ahead to DAG to finish efficiently.
    Upon profitable completion of the DAG, you must see Success:1 beneath the Runs column.
  3. Within the Airflow UI, find and select the MWAASparkPipelineDemoJob DAG.
  4. On the Graph tab, select any process (on this instance, we choose the calculate_pi process) after which select the Logs
    Image showing the MWAASparkPipelineDemoJob's graph view
  5. View the Spark logs within the Airflow UI.
    Image showing the MWAASparkPipelineDemoJob calculate_pi task logs

Migrate current Airflow DAGs to make use of BPG

In enterprise knowledge platforms, a typical knowledge pipeline consists of Amazon MWAA submitting Spark jobs to a number of EMR on EKS clusters utilizing the SparkKubernetesOperator and an Airflow Connection of sort Kubernetes. An Airflow Connection is a set of parameters and credentials used to determine communication between Amazon MWAA and exterior programs or providers. A DAG refers back to the connection identify and connects to the exterior system.

The next diagram reveals the standard structure.
Image showing the existing job execution workflows not using BPG

On this setup, Airflow DAGs usually makes use of SparkKubernetesOperator and SparkKubernetesSensor to submit Spark jobs to a distant EMR on EKS cluster utilizing kubernetes_conn_id=<connection_name>.

The next code snippet reveals the related particulars:

# Submit Spark-Pi job utilizing Kubernetes connection
submit_spark_pi = SparkKubernetesOperator(
	task_id='submit_spark_pi',
	namespace="default",
	application_file=spark_pi_yaml,
	kubernetes_conn_id='emr_on_eks_connection_[1|2]',  # Connection ID outlined in Airflow
	dag=dag
)

Emigrate the infrastructure to a BPG-based infrastructure with out impacting the continuity of the setting, we will deploy a parallel infrastructure utilizing BPG, create a brand new Airflow Connection for BPG, and incrementally migrate the DAGs to make use of the brand new connection. By doing so, we gained’t disrupt the present infrastructure till the BPG-based infrastructure is totally operational, together with the migration of all current DAGs.

The next diagram showcases the interim state the place each the Kubernetes connection and BPG connection are operational. Blue arrows point out the present workflow paths, and crimson arrows characterize the brand new BPG-based migration paths.

Image showing the existing workflow paths and the new bpg based migration path

The modified code snippet for the DAG is as follows:

# Submit Spark-Pi job utilizing BPG connection
submit_spark_pi = BPGOperator(
	task_id='submit_spark_pi',
	application_file=spark_pi_yaml,
	application_file_type="yaml"
	connection_id='bpg_connection',  # Connection ID outlined in Airflow
	dag=dag
)

Lastly, when all of the DAGs have been modified to make use of BPGOperator as a substitute of SparkKubernetesOperator, you possibly can decommission any remnants of the previous workflow. The ultimate state of the infrastructure will seem like the next diagram.

Image showing the final state of the infrastructure after all the job migrations are complete.

Utilizing this method, we will seamlessly introduce BPG into an setting that at present makes use of solely Amazon MWAA and EMR on EKS clusters.

Clear up

To keep away from incurring future costs from the assets created on this tutorial, clear up your setting after you’ve accomplished the steps. You are able to do this by working the cleanup.sh script, which is able to safely take away all of the assets provisioned throughout the setup:

cd ${REPO_DIR}/setup
./cleanup.sh

Conclusion

Within the put up Use Batch Processing Gateway to automate job administration in multi-cluster Amazon EMR on EKS environments, we launched Batch Processing Gateway as an answer for routing Spark workloads throughout a number of EMR on EKS clusters. On this put up, we demonstrated the right way to improve this basis by integrating BPG with Amazon MWAA. By way of our customized BPGOperator, we’ve proven the right way to construct strong end-to-end Spark-based knowledge processing pipelines whereas sustaining clear separation of tasks and centralized code administration. Lastly, we demonstrated the right way to seamlessly incorporate the answer into your current Amazon MWAA and EMR on EKS knowledge platform with out impacting operational continuity.

We encourage you to experiment with this structure in your individual setting, adapting it to suit your distinctive workloads and operational necessities. By implementing this answer, you possibly can construct environment friendly and scalable knowledge processing pipelines that use the complete potential of EMR on EKS and Amazon MWAA. Discover additional by deploying the answer in your AWS account whereas adhering to your organizational safety greatest practices and share your experiences with the AWS Huge Knowledge neighborhood.


In regards to the Authors

Suvojit DasguptaSuvojit Dasgupta is a Principal Knowledge Architect at AWS. He leads a staff of expert engineers in designing and constructing scalable knowledge options for AWS clients. He makes a speciality of growing and implementing modern knowledge architectures to deal with complicated enterprise challenges.

Avinash DesireddyAvinash Desireddy is a Cloud Infrastructure Architect at AWS, keen about constructing safe functions and knowledge platforms. He has intensive expertise in Kubernetes, DevOps, and enterprise structure, serving to clients containerize functions, streamline deployments, and optimize cloud-native environments.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles