Amazon EMR Serverless observability, Half 1: Monitor Amazon EMR Serverless staff in close to actual time utilizing Amazon CloudWatch

Amazon EMR Serverless means that you can run open supply huge information frameworks reminiscent of Apache Spark and Apache Hive with out managing clusters and servers. With EMR Serverless, you may run analytics workloads at any scale with computerized scaling that resizes assets in seconds to satisfy altering information volumes and processing necessities.

We’ve got launched job employee metrics in Amazon CloudWatch for EMR Serverless. This characteristic means that you can monitor vCPUs, reminiscence, ephemeral storage, and disk I/O allocation and utilization metrics at an combination employee degree in your Spark and Hive jobs.

This submit is a part of a collection about EMR Serverless observability. On this submit, we focus on the way to use these CloudWatch metrics to observe EMR Serverless staff in close to actual time.

CloudWatch metrics for EMR Serverless

On the per-Spark job degree, EMR Serverless emits the next new metrics to CloudWatch for each driver and executors. These metrics present granular insights into job efficiency, bottlenecks, and useful resource utilization.

WorkerCpuAllocated	The overall numbers of vCPU cores allotted for staff in a job run
WorkerCpuUsed	The overall numbers of vCPU cores utilized by staff in a job run
WorkerMemoryAllocated	The overall reminiscence in GB allotted for staff in a job run
WorkerMemoryUsed	The overall reminiscence in GB utilized by staff in a job run
WorkerEphemeralStorageAllocated	The variety of bytes of ephemeral storage allotted for staff in a job run
WorkerEphemeralStorageUsed	The variety of bytes of ephemeral storage utilized by staff in a job run
WorkerStorageReadBytes	The variety of bytes learn from storage by staff in a job run
WorkerStorageWriteBytes	The variety of bytes written to storage from staff in a job run

The next are the advantages of monitoring your EMR Serverless jobs with CloudWatch:

Optimize useful resource utilization – You may acquire insights into useful resource utilization patterns and optimize your EMR Serverless configurations for higher effectivity and price financial savings. For instance, underutilization of vCPUs or reminiscence can reveal useful resource wastage, permitting you to optimize employee sizes to attain potential value financial savings.
Diagnose frequent errors – You may determine root causes and mitigation for frequent errors with out log diving. For instance, you may monitor the utilization of ephemeral storage and mitigate disk bottlenecks by preemptively allocating extra storage per employee.
Achieve close to real-time insights – CloudWatch presents close to real-time monitoring capabilities, permitting you to trace the efficiency of your EMR Serverless jobs as and when they’re working, for fast detection of any anomalies or efficiency points.
Configure alerts and notifications – CloudWatch lets you arrange alarms utilizing Amazon Easy Notification Service (Amazon SNS) primarily based on predefined thresholds, permitting you to obtain notifications by way of e-mail or textual content message when particular metrics attain essential ranges.
Conduct historic evaluation – CloudWatch shops historic information, permitting you to research developments over time, determine patterns, and make knowledgeable choices for capability planning and workload optimization.

Answer overview

To additional improve this observability expertise, we now have created an answer that gathers all these metrics on a single CloudWatch dashboard for an EMR Serverless utility. You’ll want to launch one AWS CloudFormation template per EMR Serverless utility. You may monitor all the roles submitted to a single EMR Serverless utility utilizing the identical CloudWatch dashboard. To study extra about this dashboard and deploy this resolution into your personal account, confer with the EMR Serverless CloudWatch Dashboard GitHub repository.

Within the following sections, we stroll you thru how you should use this dashboard to carry out the next actions:

Optimize your useful resource utilization to save lots of prices with out impacting job efficiency
Diagnose failures because of frequent errors with out the necessity for log diving and resolve these errors optimally

Conditions

To run the pattern jobs supplied on this submit, you have to create an EMR Serverless utility with default settings utilizing the AWS Administration Console or AWS Command Line Interface (AWS CLI), after which launch the CloudFormation template from the GitHub repo with the EMR Serverless utility ID supplied because the enter to the template.

You’ll want to submit all the roles on this submit to the identical EMR Serverless utility. If you wish to monitor a distinct utility, you may deploy this template in your personal EMR Serverless utility ID.

Optimize useful resource utilization

When working Spark jobs, you usually begin with the default configurations. It may be difficult to optimize your workload with none visibility into precise useful resource utilization. A few of the most typical configurations that we’ve seen clients alter are spark.driver.cores, spark.driver.reminiscence, spark.executor.cores, and spark.executors.reminiscence.

As an example how the newly added CloudWatch dashboard worker-level metrics will help you fine-tune your job configurations for higher price-performance and enhanced useful resource utilization, let’s run the next Spark job, which makes use of the NOAA Built-in Floor Database (ISD) dataset to run some transformations and aggregations.

Use the next command to run this job on EMR Serverless. Present your Amazon Easy Storage Service (Amazon S3) bucket and EMR Serverless utility ID for which you launched the CloudFormation template. Ensure that to make use of the identical utility ID to submit all of the pattern jobs on this submit. Moreover, present an AWS Id and Entry Administration (IAM) runtime position.

aws emr-serverless start-job-run 
--name emrs-cw-dashboard-test-1 
 --application-id <APPLICATION_ID> 
 --execution-role-arn <JOB_ROLE_ARN> 
 --job-driver '{
 "sparkSubmit": {
 "entryPoint": "s3://<BUCKETNAME>/scripts/windycity.py",
 "entryPointArguments": ["s3://noaa-global-hourly-pds/2024/", "s3://<BUCKET_NAME>/emrs-cw-dashboard-test-1/"]
 } }'

Now let’s test the executor vCPUs and reminiscence from the CloudWatch dashboard.

This job was submitted with default EMR Serverless Spark configurations. From the Executor CPU Allotted metric within the previous screenshot, the job was allotted 396 vCPUs in whole (99 executors * 4 vCPUs per executor). Nonetheless, the job solely used a most of 110 vCPUs primarily based on Executor CPU Used. This means oversubscription of vCPU assets. Equally, the job was allotted 1,584 GB reminiscence in whole primarily based on Executor Reminiscence Allotted. Nonetheless, from the Executor Reminiscence Used metric, we see that the job solely used 176 GB of reminiscence in the course of the job, indicating reminiscence oversubscription.

Now let’s rerun this job with the next adjusted configurations.

	Unique Job (Default Configuration)	Rerun Job (Adjusted Configuration)
spark.executor.reminiscence	14 GB	3 GB
spark.executor.cores	4	2
spark.dynamicAllocation.maxExecutors	99	30
Whole Useful resource Utilization	6.521 vCPU-hours 26.084 memoryGB-hours 32.606 storageGB-hours	1.739 vCPU-hours 3.688 memoryGB-hours 17.394 storageGB-hours
Billable Useful resource Utilization	7.046 vCPU-hours 28.182 memoryGB-hours 0 storageGB-hours	1.739 vCPU-hours 3.688 memoryGB-hours 0 storageGB-hours

We use the next code:

aws emr-serverless start-job-run 
--name emrs-cw-dashboard-test-2 
 --application-id <APPLICATION_ID> 
 --execution-role-arn <JOB_ROLE_ARN> 
 --job-driver '{
 "sparkSubmit": {
 "entryPoint": "s3://<BUCKETNAME>/scripts/windycity.py",
 "entryPointArguments": ["s3://noaa-global-hourly-pds/2024/", "s3://<BUCKET_NAME>/emrs-cw-dashboard-test-2/"],
 "sparkSubmitParameters": "--conf spark.driver.cores=2 --conf spark.driver.reminiscence=3g --conf spark.executor.reminiscence=3g --conf spark.executor.cores=2 --conf spark.dynamicAllocation.maxExecutors=30"
 } }'

Let’s test the executor metrics from the CloudWatch dashboard once more for this job run.

Within the second job, we see decrease allocation of each vCPUs (396 vs. 60) and reminiscence (1,584 GB vs. 120 GB) as anticipated, leading to higher utilization of assets. The unique job ran for 4 minutes, 41 seconds. The second job took 4 minutes, 54 seconds. This reconfiguration has resulted in 79% decrease value financial savings with out affecting the job efficiency.

You should utilize these metrics to additional optimize your job by rising or reducing the variety of staff or the allotted assets.

Diagnose and resolve job failures

Utilizing the CloudWatch dashboard, you may diagnose job failures because of points associated to CPU, reminiscence, and storage reminiscent of out of reminiscence or no house left on the system. This lets you determine and resolve frequent errors shortly with out having to test the logs or navigate by way of Spark Historical past Server. Moreover, as a result of you may test the useful resource utilization from the dashboard, you may fine-tune the configurations by rising the required assets solely as a lot as wanted as a substitute of oversubscribing to the assets, which additional saves prices.

Driver errors

As an example this use case, let’s run the next Spark job, which creates a big Spark information body with a couple of million rows. Sometimes, this operation is finished by the Spark driver. Whereas submitting the job, we additionally configure spark.rpc.message.maxSize, as a result of it’s required for job serialization of knowledge frames with numerous columns.

aws emr-serverless start-job-run 
--name emrs-cw-dashboard-test-3 
--application-id <APPLICATION_ID> 
--execution-role-arn <JOB_ROLE_ARN> 
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/create-large-disk.py"
"sparkSubmitParameters": "--conf spark.rpc.message.maxSize=2000"
} }'

After a couple of minutes, the job failed with the error message “Encountered errors when releasing containers,” as seen within the Job particulars part.

When encountering non-descriptive error messages, it turns into essential to research additional by analyzing the motive force and executor logs to troubleshoot additional. However earlier than additional log diving, let’s first test the CloudWatch dashboard, particularly the motive force metrics, as a result of releasing containers is mostly carried out by the motive force.

We are able to see that the Driver CPU Used and Driver Storage Used are properly inside their respective allotted values. Nonetheless, upon checking Driver Reminiscence Allotted and Driver Reminiscence Used, we are able to see that the motive force was utilizing the entire 16 GB reminiscence allotted to it. By default, EMR Serverless drivers are assigned 16 GB reminiscence.

Let’s rerun the job with extra driver reminiscence allotted. Let’s set driver reminiscence to 27 GB as the place to begin, as a result of spark.driver.reminiscence + spark.driver.memoryOverhead must be lower than 30 GB for the default employee sort. park.rpc.messsage.maxSize can be unchanged.

aws emr-serverless start-job-run 
—title emrs-cw-dashboard-test-4 
—application-id <APPLICATION_ID> 
—execution-role-arn <JOB_ROLE_ARN> 
—job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/create-large-disk.py"
"sparkSubmitParameters": "--conf spark.driver.reminiscence=27G --conf spark.rpc.message.maxSize=2000"
} }'

The job succeeded this time round. Let’s test the CloudWatch dashboard to look at driver reminiscence utilization.

As we are able to see, the allotted reminiscence is now 30 GB, however the precise driver reminiscence utilization didn’t exceed 21 GB in the course of the job run. Due to this fact, we are able to additional optimize prices right here by decreasing the worth of spark.driver.reminiscence. We reran the identical job with spark.driver.reminiscence set to 22 GB, and the job nonetheless succeeded with higher driver reminiscence utilization.

Executor errors

Utilizing CloudWatch for observability is good for diagnosing driver-related points as a result of there is just one driver per job and driver assets used is the precise useful resource utilization of the one driver. However, executor metrics are aggregated throughout all the employees. Nonetheless, you should use this dashboard to offer solely an ample quantity of assets to make your job succeed, thereby avoiding oversubscription of assets.

As an example, let’s run the next Spark job, which simulates uniform disk over-utilization throughout all staff by processing very massive NOAA datasets from a number of years. This job additionally transiently caches a really massive information body on disk.

aws emr-serverless start-job-run 
--name emrs-cw-dashboard-test-5 
--application-id <APPLICATION_ID> 
--execution-role-arn <JOB_ROLE_ARN> 
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/noaa-disk.py"
} }'

After a couple of minutes, we are able to see that the job failed with “No house left on system” error within the Job particulars part, which signifies that a number of the staff have run out of disk house.

Checking the Operating Executors metric from the dashboard, we are able to determine that there have been 99 executor staff working. Every employee comes with 20 GB storage by default.

As a result of this can be a Spark job failure, let’s test the Executor Storage Allotted and Executor Storage Used metrics from the dashboard (as a result of the motive force gained’t run any duties).

As we are able to see, the 99 executors have used up a complete of 1,940 GB from the overall allotted executor storage of two,126 GB. This consists of each the info shuffled by the executors and the storage used for caching the info body. We don’t see the total 2,126 GB being utilized from this graph as a result of there could be a couple of executors out of the 99 executors that weren’t holding a lot information when the job failed (earlier than these executors might begin processing duties and retailer the info body chunks).

Let’s rerun the identical job however with elevated executor disk dimension utilizing the parameter spark.emr-serverless.executor.disk. Let’s attempt with 40 GB disk per executor as a place to begin.

aws emr-serverless start-job-run 
--name emrs-cw-dashboard-test-6 
--application-id <APPLICATION_ID> 
--execution-role-arn <JOB_ROLE_ARN> 
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/noaa-disk.py"
"sparkSubmitParameters": "--conf spark.emr-serverless.executor.disk=40G"
}
}'

This time, the job ran efficiently. Let’s test the Executor Storage Allotted and Executor Storage Used metrics.

Executor Storage Allotted is now 4,251 GB as a result of we’ve doubled the worth of spark.emr-serverless.executor.disk. Though there may be now twice as a lot aggregated executors’ storage, the job nonetheless used solely a most of 1,940 GB out of 4,251 GB. This means that our executors have been seemingly working out of disk house solely by a couple of GBs. Due to this fact, we are able to attempt to set spark.emr-serverless.executor.disk to a good decrease worth like 25 GB or 30 GB as a substitute of 40 GB to save lots of storage prices as we did within the earlier state of affairs. As well as, you may monitor Executor Storage Learn Bytes and Executor Storage Write Bytes to see in case your job is I/O intensive. On this case, you should use the Shuffle-optimized disks characteristic of EMR Serverless to additional improve your job’s I/O efficiency.

The dashboard can be helpful to seize details about transient storage used whereas caching or persisting the info frames, together with spill-to-disk eventualities. The Storage tab of Spark Historical past Server information any caching actions, as seen within the following screenshot. Nonetheless, this information can be misplaced from Spark Historical past Server after the cache is evicted or when the job finishes. Due to this fact, Executor Storage Used can be utilized to do an evaluation of a failed job run because of transient storage points.

On this explicit instance, the info was evenly distributed among the many executors. Nonetheless, when you have a knowledge skew (for, instance just one–2 executors out of 99 course of probably the most quantity of knowledge, and in consequence, your job runs out of disk house), the CloudWatch dashboard gained’t precisely seize this state of affairs as a result of the storage information is aggregated throughout all of the executors for a job. For diagnosing points on the particular person executor degree, we have to monitor per-executor-level metrics. We discover extra superior examples of how per-worker-level metrics will help you determine, mitigate, and resolve hard-to-find points by way of EMR Serverless integration with Amazon Managed Service for Prometheus.

Conclusion

On this submit, you realized the way to successfully handle and optimize your EMR Serverless utility utilizing a single CloudWatch dashboard with enhanced EMR Serverless metrics. These metrics can be found in all AWS Areas the place EMR Serverless is obtainable. For extra particulars about this characteristic, confer with Job-level monitoring.

In regards to the Authors

Kashif Khan is a Sr. Analytics Specialist Options Architect at AWS, specializing in huge information companies like Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With over a decade of expertise within the huge information area, he possesses intensive experience in architecting scalable and sturdy options. His position entails offering architectural steering and collaborating carefully with clients to design tailor-made options utilizing AWS analytics companies to unlock the total potential of their information.

Veena Vasudevan is a Principal Associate Options Architect and Knowledge & AI specialist at AWS. She helps clients and companions construct extremely optimized, scalable, and safe options; modernize their architectures; and migrate their huge information, analytics, and AI/ML workloads to AWS.

Amazon EMR Serverless observability, Half 1: Monitor Amazon EMR Serverless staff in close to actual time utilizing Amazon CloudWatch

CloudWatch metrics for EMR Serverless

Answer overview

Conditions

Optimize useful resource utilization

Diagnose and resolve job failures

Driver errors

Executor errors

Conclusion

In regards to the Authors

Related Articles

At the moment’s Prime Buyer Expectations: Transparency, Timing, and a Little Empathy

Findings Report From the SOC at RSAC™ 2025 Convention

Relevance and Benefits for Cloud-Native Growth

LEAVE A REPLY Cancel reply

Latest Articles

At the moment’s Prime Buyer Expectations: Transparency, Timing, and a Little Empathy

Findings Report From the SOC at RSAC™ 2025 Convention

Relevance and Benefits for Cloud-Native Growth

Enterprise Structure & Use Circumstances

Deel scores a lawsuit win, however not in opposition to Rippling