Construct a centralized observability platform for Apache Spark on Amazon EMR on EKS utilizing exterior Spark Historical past Server

Monitoring and troubleshooting Apache Spark purposes turn into more and more complicated as corporations scale their knowledge analytics workloads. As knowledge processing necessities develop, enterprises deploy these purposes throughout a number of Amazon EMR on EKS clusters to deal with various workloads effectively. Nonetheless, this method creates a problem in sustaining complete visibility into Spark purposes working throughout these separate clusters. Information engineers and platform groups want a unified view to successfully monitor and optimize their Spark purposes.

Though Spark offers highly effective built-in monitoring capabilities by means of Spark Historical past Server (SHS), implementing a scalable and safe observability resolution throughout a number of clusters requires cautious architectural concerns. Organizations want an answer that not solely consolidates Spark software metrics however extends its options by including different efficiency monitoring and troubleshooting packages whereas offering safe entry to those insights and sustaining operational effectivity.

This publish demonstrates the best way to construct a centralized observability platform utilizing SHS for Spark purposes working on EMR on EKS. We showcase the best way to improve SHS with efficiency monitoring instruments, with a sample relevant to many monitoring options akin to SparkMeasure and DataFlint. On this publish, we use DataFlint for instance to show how one can combine further monitoring options. We clarify the best way to accumulate Spark occasions from a number of EMR on EKS clusters right into a central Amazon Easy Storage Service (Amazon S3) bucket; deploy SHS on a devoted Amazon Elastic Kubernetes Service (Amazon EKS) cluster; and configure safe entry utilizing AWS Load Balancer Controller, AWS Personal Certificates Authority, Amazon Route 53, and AWS Consumer VPN. This resolution offers groups with a single, safe interface to watch, analyze, and troubleshoot Spark purposes throughout a number of clusters.

Overview of resolution

Take into account DataCorp Analytics, a data-driven enterprise working a number of enterprise models with various Spark workloads. Their Monetary Analytics workforce processes time-sensitive buying and selling knowledge requiring strict processing instances and devoted sources, and their Advertising Analytics workforce handles buyer habits knowledge with versatile necessities, requiring a number of EMR on EKS clusters to accommodate these distinct workload patterns. As their Spark purposes develop in quantity and complexity throughout these clusters, knowledge and platform engineers battle to take care of complete visibility whereas sustaining safe entry to monitoring instruments.

This situation presents a great use case for implementing a centralized observability platform utilizing SHS and DataFlint. The answer deploys SHS on a devoted EKS cluster, configured to learn occasions from a number of EMR on EKS clusters by means of a centralized S3 bucket. Entry is secured by means of Load Balancer Controller, AWS Personal CA, Route 53, and Consumer VPN, and DataFlint enhances the monitoring capabilities with further insights and visualizations. The next structure diagram illustrates the parts and their interactions.

The answer workflow is as follows:

Spark purposes on EMR on EKS use a customized EMR Docker picture that features DataFlint JARs for enhanced metrics assortment. These purposes generate detailed occasion logs containing execution metrics, efficiency knowledge, and DataFlint-specific insights. The logs are written to a centralized Amazon S3 location by means of the next configuration (word particularly the configurationOverrides part). For extra data, discover the StartJobRun information to discover ways to run Spark jobs and evaluation the StartJobRun API reference.

{
  "identify": "${SPARK_JOB_NAME}", 
  "virtualClusterId": "${VIRTUAL_CLUSTER_ID}",  
  "executionRoleArn": "${IAM_ROLE_ARN_FOR_JOB_EXECUTION}",
  "releaseLabel": "emr-7.2.0-latest", 
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://${S3_BUCKET_NAME}/app/${SPARK_APP_FILE}",
      "entryPointArguments": [
        "--input-path",
        "s3://${S3_BUCKET_NAME}/data/input",
        "--output-path",
        "s3://${S3_BUCKET_NAME}/data/output"
      ],
       "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.reminiscence=4G --conf spark.kubernetes.driver.restrict.cores=1200m --conf spark.executor.cores=2  --conf spark.executor.situations=3  --conf spark.executor.reminiscence=4G"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.container.image": "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${EMR_REPO_NAME}:${EMR_IMAGE_TAG}",
          "spark.app.name": "${SPARK_JOB_NAME}"
          "spark.eventLog.enabled": "true",
          "spark.eventLog.dir": "s3://${S3_BUCKET_NAME}/spark-events/"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED",
      "s3MonitoringConfiguration": {
        "logUri": "s3://${S3_BUCKET_NAME}/spark-events/"
      }
    }
  }
}

A devoted SHS deployed on Amazon EKS reads these centralized logs. The Amazon S3 location is configured within the SHS to learn from the central Amazon S3 location by means of the next code:

env:
  - identify: SPARK_HISTORY_OPTS
    worth: "-Dspark.historical past.fs.logDirectory=s3a://${S3_BUCKET}/spark-events/"

We configure Load Balancer Controller, AWS Personal CA, a Route 53 hosted zone, and Consumer VPN to securely entry the SHS UI utilizing an online browser.
Lastly, customers can entry the SHS net interface at https://spark-history-server.instance.inside/.

You’ll find the code base within the AWS Samples GitHub repository.

Conditions

Earlier than you deploy this resolution, be certain that the next conditions are in place:

Arrange a typical infrastructure

Full the next steps to arrange the infrastructure:

Clone the repository to your native machine and set the 2 atmosphere variables. Exchange <AWS_REGION> with the AWS Area the place you need to deploy these sources.

git clone [email protected]:aws-samples/sample-centralized-spark-history-server-emr-on-eks.git
cd sample-centralized-spark-history-server-emr-on-eks
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

Execute the next script to create the frequent infrastructure. The script creates a safe digital personal cloud (VPC) networking atmosphere with private and non-private subnets and an encrypted S3 bucket to retailer Spark software logs.

cd ${REPO_DIR}/infra
./deploy_infra.sh

To confirm profitable infrastructure deployment, open the AWS CloudFormation console, select your stack, and test the Occasions, Sources, and Outputs tabs for completion standing, particulars, and checklist of sources created.

Arrange EMR on EKS clusters

This part covers constructing a customized EMR on EKS Docker picture with DataFlint integration, launching two EMR on EKS clusters (datascience-cluster-v and analytics-cluster-v), and configuring the clusters for job submission. Moreover, we arrange the mandatory IAM roles for service accounts (IRSA) to allow Spark jobs to jot down occasions to the centralized S3 bucket. Full the next steps:

Deploy two EMR on EKS clusters:

cd ${REPO_DIR}/emr-on-eks
./deploy_emr_on_eks.sh

To confirm profitable creation of the EMR on EKS clusters utilizing the AWS CLI, execute the next command:

aws emr-containers list-virtual-clusters 
    --query "virtualClusters[?state=='RUNNING']"

Execute the next command for the datascience-cluster-v and analytics-cluster-v clusters to confirm their respective states, container supplier data, and related EKS cluster particulars. Exchange <VIRTUAL-CLUSTER-ID> with the ID of every cluster obtained from the list-virtual-clusters output.

aws emr-containers describe-virtual-cluster 
    --id <VIRTUAL-CLUSTER-ID>

Configure and execute Spark jobs on EMR on EKS clusters

Full the next steps to configure and execute Spark jobs on the EMR on EKS clusters:

Generate customized EMR on EKS picture and StartJobRun request JSON information to run Spark jobs:

cd ${REPO_DIR}/jobs
./configure_jobs.sh

The script performs the next duties:

Prepares the atmosphere by importing the pattern Spark software spark_history_demo.py to a delegated S3 bucket for job execution.
Creates a customized Amazon EMR container picture by extending the bottom EMR 7.2.0 picture with the DataFlint JAR for added insights and publishing it to an Amazon Elastic Container Registry (Amazon ECR) repository.
Generates cluster-specific StartJobRun request JSON information for datascience-cluster-v and analytics-cluster-v.

Overview start-job-run-request-datascience-cluster-v.json and start-job-run-request-analytics-cluster-v.json for added particulars.

Execute the next instructions to submit Spark jobs on the EMR on EKS digital clusters:

aws emr-containers start-job-run 
--cli-input-json file://${REPO_DIR}/jobs/start-job-run/start-job-run-request-datascience-cluster-v.json
aws emr-containers start-job-run 
--cli-input-json file://${REPO_DIR}/jobs/start-job-run/start-job-run-request-analytics-cluster-v.json

Confirm the profitable era of the logs within the S3 bucket:

aws s3 ls s3://emr-spark-logs-<AWS_ACCOUNT_ID>-<AWS_REGION>/spark-events/

You might have efficiently arrange an EMR on EKS atmosphere, executed Spark jobs, and picked up the logs within the centralized S3 bucket. Subsequent, we’ll deploy SHS, configure its safe entry, and visualize the logs utilizing it.

Arrange AWS Personal CA and create a Route 53 personal hosted zone

Use the next code to deploy AWS Personal CA and create a Route 53 personal hosted zone. This can present a user-friendly URL to hook up with SHS over HTTPS.

cd ${REPO_DIR}/ssl
./deploy_ssl.sh

Arrange SHS on Amazon EKS

Full the next steps to construct a Docker picture containing SHS with DataFlint, deploy it on an EKS cluster utilizing a Helm chart, and expose it by means of a Kubernetes service of sort LoadBalancer. We use a Spark 3.5.0 base picture, which incorporates SHS by default. Nonetheless, though this simplifies deployment, it ends in a bigger picture measurement. For environments the place picture measurement is essential, think about constructing a customized picture with simply the standalone SHS part as an alternative of utilizing the whole Spark distribution.

Deploy SHS on the spark-history-server EKS cluster:

cd ${REPO_DIR}/shs
./deploy_shs.sh

Confirm the deployment by itemizing the pods and viewing the pod logs:

kubectl get pods --namespace spark-history
kubectl logs <SHS-PODNAME> --namespace spark-history

Overview the logs and ensure there aren’t any errors or exceptions.

You might have efficiently deployed SHS on the spark-history-server EKS cluster, and configured it to learn logs from the emr-spark-logs-<AWS_ACCOUNT_ID>-<AWS_REGION> S3 bucket.

Deploy Consumer VPN and add entry to Route 53 for safe entry

Full the next steps to deploy Consumer VPN to securely join your shopper machine (akin to your laptop computer) to SHS and configure Route 53 to generate a user-friendly URL:

Deploy the Consumer VPN:

cd ${REPO_DIR}/vpn
./deploy_vpn.sh

Add entry to Route 53:

cd ${REPO_DIR}/dns
./deploy_dns.sh

Add certificates to native trusted shops

Full the next steps so as to add the SSL certificates to your working system’s trusted certificates shops for safe connections:

For macOS customers, utilizing Keychain Entry (GUI):
1. Open Keychain Entry from Functions, Utilities, select the System keychain within the navigation pane, and select File, Import Gadgets.
2. Browse to and select ${REPO_DIR}/ssl/certificates/ca-certificate.pem, then select the imported certificates.
3. Develop the Belief part and set When utilizing this certificates to At all times Belief.
4. Shut and enter your password when prompted and save.
5. Alternatively, you possibly can execute the next command to incorporate the certificates in Keychain and belief it:

sudo safety add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain "${REPO_DIR}/ssl/certificates/ca-certificate.pem"

For Home windows customers:
1. Rename ca-certificate.pem to ca-certificate.crt.
2. Select (right-click) ca-certificate.crt and select Set up Certificates.
3. Select Native Machine (admin rights required).
4. Choose Place all certificates within the following retailer.
5. Select Browse and select Trusted Root Certification Authorities.
6. Full the set up by selecting Subsequent and End.

Arrange Consumer VPN in your shopper machine for safe entry

Full the next steps to put in and configure Consumer VPN in your shopper machine (akin to your laptop computer) and create a VPN connection to the AWS Cloud:

Obtain, set up, and launch the Consumer VPN software from the official obtain web page on your working system.
Create your VPN profile:
1. Select File within the menu bar, select Handle Profiles, and select Add Profile.
2. Enter a reputation on your profile. Instance: SparkHistoryServerUI
3. Browse to ${REPO_DIR}/vpn/client_vpn_certs/client-config.ovpn, select the certificates file, and select Add Profile to avoid wasting your configuration.
Choose your newly created profile, select Join, and watch for the connection affirmation to determine the VPN connection.

If you’re linked, you’ll have safe entry to the AWS sources in your atmosphere.

Securely entry the SHS URL

Full the next steps to securely entry SHS utilizing an online browser:

Execute the next command to get the SHS URL:

https://spark-history-server.instance.inside/

Copy this URL and enter it into your net browser to entry the SHS UI.

The next screenshot exhibits an instance of the UI.

Select an App ID to view its detailed execution data and metrics.

Select the DataFlint tab to view detailed software insights and analytics.

DataFlint shows numerous useful metrics, together with alerts, as proven within the following screenshot.

Clear up

To keep away from incurring future prices from the sources created on this tutorial, clear up your atmosphere after finishing the steps. To take away all provisioned sources:

Disconnect from the Consumer VPN.
Run the cleanup.sh script:

cd ${REPO_DIR}/
./cleanup.sh

Conclusion

On this publish, we demonstrated the best way to construct a centralized observability platform for Spark purposes utilizing SHS and improve SHS with efficiency monitoring instruments like DataFlint. The answer aggregates Spark occasions from a number of EMR on EKS clusters right into a unified monitoring interface, offering complete visibility into your Spark purposes’ efficiency and useful resource utilization. By utilizing a customized EMR picture with efficiency monitoring device integration, we enhanced the usual Spark metrics to achieve deeper insights into software habits. In case your atmosphere makes use of a mixture of EMR on EKS, Amazon EMR on EC2, or Amazon EMR Serverless, you possibly can seamlessly prolong this structure to mixture the logs from EMR on EC2 and EMR Serverless in the same method and visualize them utilizing SHS.

Though this resolution offers a strong basis for Spark monitoring, manufacturing deployments ought to think about implementing authentication and authorization. SHS helps customized authentication by means of javax servlet filters and fine-grained authorization by means of entry management lists (ACLs). We encourage you to discover implementing authentication filters for safe entry management, configuring user- and group-based ACLs for view and modify permissions, and establishing group mapping suppliers for role-based entry. For detailed steerage, check with Spark’s net UI safety documentation and SHS safety features.

Whereas AWS endeavors to use finest practices for safety inside this instance, every group has its personal insurance policies. Please be certain that to make use of the particular insurance policies of your group when deploying this resolution as a place to begin for implementing centralized Spark monitoring in your knowledge processing atmosphere.

In regards to the authors

Sri Potluri is a Cloud Infrastructure Architect at AWS. He’s enthusiastic about fixing complicated issues and delivering well-structured options for various clients. His experience spans throughout a variety of cloud applied sciences, offering scalable and dependable infrastructures tailor-made to every challenge’s distinctive challenges.

Suvojit Dasgupta is a Principal Information Architect at AWS. He leads a workforce of expert engineers in designing and constructing scalable knowledge options for AWS clients. He makes a speciality of creating and implementing revolutionary knowledge architectures to deal with complicated enterprise challenges.

Construct a centralized observability platform for Apache Spark on Amazon EMR on EKS utilizing exterior Spark Historical past Server

Overview of resolution

Conditions

Arrange a typical infrastructure

Arrange EMR on EKS clusters

Configure and execute Spark jobs on EMR on EKS clusters

Arrange AWS Personal CA and create a Route 53 personal hosted zone

Arrange SHS on Amazon EKS

Deploy Consumer VPN and add entry to Route 53 for safe entry

Add certificates to native trusted shops

Arrange Consumer VPN in your shopper machine for safe entry

Securely entry the SHS URL

Clear up

Conclusion

In regards to the authors

Related Articles

Methods for Modernizing Legacy Programs

Harness Launches Two Main Initiatives to Safe the Way forward for AI-Powered Software program Supply

Product Backlog Refinement: How Scrum Groups Do It Proper

LEAVE A REPLY Cancel reply

Latest Articles

Methods for Modernizing Legacy Programs

Harness Launches Two Main Initiatives to Safe the Way forward for AI-Powered Software program Supply

Product Backlog Refinement: How Scrum Groups Do It Proper

Skate Story with Sam Eng

Checkmarx unveils AppSec platform for the Age of Agentic Growth