Attribute Amazon EMR on EC2 prices to your end-users

Amazon EMR on EC2 is a managed service that makes it simple to run massive knowledge processing and analytics workloads on AWS. It simplifies the setup and administration of in style open supply frameworks like Apache Hadoop and Apache Spark, permitting you to concentrate on extracting insights from massive datasets quite than the underlying infrastructure. With Amazon EMR, you’ll be able to benefit from the facility of those massive knowledge instruments to course of, analyze, and acquire beneficial enterprise intelligence from huge quantities of knowledge.

Price optimization is among the pillars of the Effectively-Architected Framework. It focuses on avoiding pointless prices, deciding on probably the most acceptable useful resource sorts, analyzing spend over time, and scaling out and in to satisfy enterprise wants with out overspending. An optimized workload maximizes the usage of all out there sources, delivers the specified end result on the most cost-effective worth level, and meets your purposeful wants.

The present Amazon EMR pricing web page reveals the estimated value of the cluster. You may also use AWS Price Explorer to get extra detailed details about your prices. These views provide you with an total image of your Amazon EMR prices. Nonetheless, it’s possible you’ll have to attribute prices on the particular person Spark job degree. For instance, you may wish to know the utilization value in Amazon EMR for the finance enterprise unit. Or, for chargeback functions, you may have to combination the price of Spark purposes by purposeful space. After you’ve got allotted prices to particular person Spark jobs, this knowledge might help you make knowledgeable selections to optimize your prices. As an illustration, you can select to restructure your purposes to make the most of fewer sources. Alternatively, you may choose to discover totally different pricing fashions like Amazon EMR on EKS or Amazon EMR Serverless.

On this publish, we share a chargeback mannequin that you should use to trace and allocate the prices of Spark workloads operating on Amazon EMR on EC2 clusters. We describe an method that assigns Amazon EMR prices to totally different jobs, groups, or strains of enterprise. You should use this function to distribute prices throughout numerous enterprise items. This may help you in monitoring the return on funding in your Spark-based workloads.

Resolution overview

The answer is designed that will help you observe the price of your Spark purposes operating on EMR on EC2. It might probably enable you to determine value optimizations and enhance the cost-efficiency of your EMR clusters.

The proposed resolution makes use of a scheduled AWS Lambda perform that operates each day. The perform captures utilization and price metrics, that are subsequently saved in Amazon Relational Database Service (Amazon RDS) tables. The information saved within the RDS tables is then queried to derive chargeback figures and generate reporting tendencies utilizing Amazon QuickSight. The utilization of those AWS companies incurs extra prices for implementing this resolution. Alternatively, you’ll be able to take into account an method that includes a cron-based agent script put in in your current EMR cluster, if you wish to keep away from the usage of extra AWS companies and related prices for constructing your chargeback resolution. This script shops the related metrics in an Amazon Easy Storage Service (Amazon S3) bucket, and makes use of Python Jupyter notebooks to generate chargeback numbers based mostly on the info information saved in Amazon S3, utilizing AWS Glue tables.

The next diagram reveals the present resolution structure.

The workflow consists of the next steps:

A Lambda perform will get the next parameters from Parameter Retailer, a functionality of AWS Methods Supervisor:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMR_Cost_Measure",
  "emrcluster_role": "dt-dna-shared",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "postgres",
    "person": "postgresadmin",
    "secretid": "postgressecretid"
  }
}

The Lambda perform extracts Spark software run logs from the EMR cluster utilizing the Useful resource Supervisor API. The next metrics are extracted as a part of the method: vcore-seconds, reminiscence MB-seconds, and storage GB-seconds.
The Lambda perform captures the day by day value of EMR clusters from Price Explorer.
The Lambda perform additionally extracts EMR On-Demand and Spot Occasion utilization knowledge utilizing the Amazon Elastic Compute Cloud (Amazon EC2) Boto3 APIs.
Lambda perform hundreds these datasets into an RDS database.
The price of operating a Spark software is decided by the quantity of CPU sources it makes use of, in comparison with the whole CPU utilization of all Spark purposes. This info is used to distribute the general value amongst totally different groups, enterprise strains, or EMR queues.

The extraction course of runs day by day, extracting the day prior to this’s knowledge and storing it in an Amazon RDS for PostgreSQL desk. The historic knowledge within the desk must be purged based mostly in your use case.

The answer is open supply and out there on GitHub.

You should use the AWS Cloud Improvement Package (AWS CDK) to deploy the Lambda perform, RDS for PostgreSQL knowledge mannequin tables, and a QuickSight dashboard to trace EMR cluster value on the job, staff, or enterprise unit degree.

The next schema present the tables used within the resolution that are queried by QuickSight to populate the dashboard.

emr_applications_execution_log_lz or public.emr_applications_execution_log – Storage for day by day run metrics for all jobs run on the EMR cluster:
- appdatecollect – Log assortment date
- app_id – Spark job run ID
- app_name – Run identify
- queue – EMR queue wherein job was run
- job_state – Job operating state
- job_status – Job run ultimate standing (Succeeded or Failed)
- starttime – Job begin time
- endtime – Job finish time
- runtime_seconds – Runtime in seconds
- vcore_seconds – Consumed vCore CPU in seconds
- memory_seconds – Reminiscence consumed
- running_containers – Containers used
- rm_clusterid – EMR cluster ID
emr_cluster_usage_cost – Captures Amazon EMR and Amazon EC2 day by day value consumption from Price Explorer and hundreds the info into the RDS desk:
- costdatecollect – Price assortment date
- startdate – Price begin date
- enddate – Price finish date
- emr_unique_tag – EMR cluster related tag
- net_unblendedcost – Complete unblended day by day greenback value
- unblendedcost – Complete unblended day by day greenback value
- cost_type – Every day value
- service_name – AWS service for which the associated fee incurred (Amazon EMR and Amazon EC2)
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster identify
- loadtime – Desk load date/time
emr_cluster_instances_usage – Captures the aggregated useful resource utilization (vCores) and allotted sources for every EMR cluster node, and helps determine the idle time of the cluster:
- instancedatecollect – Occasion utilization gather date
- emr_instance_day_run_seconds – EMR occasion lively seconds within the day
- emr_region – EMR cluster AWS Area
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster identify
- emr_cluster_fleet_type – EMR cluster fleet sort
- emr_node_type – Occasion node sort
- emr_market – Market sort (on-demand or provisioned)
- emr_instance_type – Occasion dimension
- emr_ec2_instance_id – Corresponding EC2 occasion ID
- emr_ec2_status – Working standing
- emr_ec2_default_vcpus – Allotted vCPU
- emr_ec2_memory – EC2 occasion reminiscence
- emr_ec2_creation_datetime – EC2 occasion creation date/time
- emr_ec2_end_datetime – EC2 occasion finish date/time
- emr_ec2_ready_datetime – EC2 occasion prepared date/time
- loadtime – Desk load date/time

Stipulations

You could have the next stipulations earlier than implementing the answer:

An EMR on EC2 cluster.
The EMR cluster will need to have a novel tag worth outlined. You possibly can assign the tag straight on the Amazon EMR console or utilizing Tag Editor. The really helpful tag secret is cost-center together with a novel worth in your EMR cluster. After you create and apply user-defined tags, it will possibly take as much as 24 hours for the tag keys to look in your value allocation tags web page for activation
Activate the tag in AWS Billing. It takes about 24 hours to activate the tag if not carried out earlier than. To activate the tag, comply with these steps:
- On the AWS Billing and Price Administration console, select Price allocation tags from navigation pane.
- Choose the tag key that you simply wish to activate.
- Select Activate.
The Spark software’s identify ought to comply with the standardized naming conference. It consists of seven elements separated by underscores: <business_unit>_<program>_<software>_<supply>_<job_name>_<frequency>_<job_type>. These elements are used to summarize the useful resource consumption and price within the ultimate report. For instance: HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD, FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD, or MKT_CAMPAIGN_CRM_CRMDB_TOPRATEDCAMPAIGN_DLY_LD. The appliance identify should be provided with the spark submit command utilizing the --name parameter with the standardized naming conference. If any of those elements don’t have a worth, hardcode the values with the next steered names:
- frequency
- job_type
- Business_unit
The Lambda perform ought to be capable to connect with Price Explorer, connect with the EMR cluster by way of the Useful resource Supervisor APIs, and cargo knowledge into the RDS for PostgreSQL database. To do that, it’s good to configure the Lambda perform as follows:
- VPC configuration – The Lambda perform ought to be capable to entry the EMR cluster, Price Explorer, AWS Secrets and techniques Supervisor, and Parameter Retailer. If entry just isn’t in place already, you are able to do this by making a digital non-public cloud (VPC) that features the EMR cluster and create VPC endpoint for Parameter Retailer and Secrets and techniques Supervisor and connect it to the VPC. As a result of there is no such thing as a VPC endpoint out there for Price Explorer and with a view to have Lambda connect with Price Explorer, a non-public subnet and a route desk are required to ship VPC visitors to public NAT gateway. In case your EMR cluster is in public subnet, you could create a non-public subnet together with a customized route desk and a public NAT gateway, which can permit the Price Explorer connection to move from the VPC non-public subnet. Consult with How do I arrange a NAT gateway for a non-public subnet in Amazon VPC? for setup directions and connect the newly created non-public subnet to the Lambda perform explicitly.
- IAM position – The Lambda perform must have an AWS Id and Entry Administration (IAM) position with the next permissions: AmazonEC2ReadOnlyAccess, AWSCostExplorerFullAccess, and AmazonRDSDataFullAccess. This position will likely be created robotically throughout AWS CDK stack deployment; you don’t have to set it up individually.
The AWS CDK needs to be put in on AWS Cloud9 (most well-liked) or one other improvement setting equivalent to VSCode or Pycharm. For extra info, check with Stipulations.
The RDS for PostgreSQL database (v10 or increased) credentials needs to be saved in Secrets and techniques Supervisor. For extra info, check with Storing database credentials in AWS Secrets and techniques Supervisor.

Create RDS tables

Create the info mannequin tables talked about in emr-cost-rds-tables-ddl.sql by logging in to postgres rds manually into the general public schema.

Use DBeaver or any suitable SQL purchasers to connect with the RDS occasion and validate the tables have been created.

Deploy AWS CDK stacks

Full the steps on this part to deploy the next sources utilizing the AWS CDK:

Parameter Retailer to retailer required parameter values
IAM position for the Lambda perform to assist connect with Amazon EMR and underlying EC2 cases, Price Explorer, CloudWatch, and Parameter Retailer
Lambda perform

Clone the GitHub repo:

git clone [email protected]:aws-samples/attribute-amazon-emr-costs-to-your-end-users.git

Replace the next the setting parameters in cdk.context.json (this file might be present in the primary listing):
1. yarn_url – YARN ResourceManager URL to learn job run logs and metrics. This URL needs to be accessible throughout the VPC the place Lambda can be deployed.
2. tbl_applicationlogs_lz – RDS temp desk to retailer EMR software run logs.
3. tbl_applicationlogs – RDS desk to retailer EMR software run logs.
4. tbl_emrcost – RDS desk to seize day by day EMR cluster utilization value.
5. tbl_emrinstance_usage – RDS desk to retailer EMR cluster occasion utilization data.
6. emrcluster_id – EMR cluster occasion ID.
7. emrcluster_name – EMR cluster identify.
8. emrcluster_tag – Tag key assigned to EMR cluster.
9. emrcluster_tag_value – Distinctive worth for EMR cluster tag.
10. emrcluster_role – Service position for Amazon EMR (EMR position).
11. emrcluster_linkedaccount – Account ID below which the EMR cluster is operating.
12. postgres_rds – RDS for PostgreSQL connection particulars.
13. vpc_id – VPC ID wherein the EMR cluster is configured and the associated fee metering Lambda perform can be deployed.
14. vpc_subnets – Comma-separated non-public subnets ID related to the VPC.
15. sg_id – EMR safety group ID.

The next is a pattern cdk.context.json file after being populated with the parameters:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMRClusterName",
  "emrcluster_tag": "EMRClusterTag",
  "emrcluster_tag_value": "EMRClusterUniqueTagValue",
  "emrcluster_role": "EMRClusterServiceRole",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "dbname",
    "person": "username",
    "secretid": "DatabaseUserSecretID"
  },
  "vpc_id": "xxxxxxxxx",
  "vpc_subnets": "subnet-xxxxxxxxxxx",
  "sg_id": "xxxxxxxxxx"
}

You possibly can select to deploy the AWS CDK stack utilizing AWS Cloud9 or every other improvement setting in response to your wants. For directions to arrange AWS Cloud9, check with Getting began: fundamental tutorials for AWS Cloud9.

Go to AWS Cloud9 and select File and Add native information add the venture folder.

Deploy the AWS CDK stack with the next code:

cd attribute-amazon-emr-costs-to-your-end-users/
pip set up -r necessities.txt
cdk deploy –-all

The deployed Lambda perform requires two exterior libraries: psycopg2 and requests. The corresponding layer must be created and assigned to the Lambda perform. For directions to create a Lambda layer for the requests module, check with Step-by-Step Information to Creating an AWS Lambda Operate Layer.

Creation of the psycopg2 bundle and layer is tied to the Python runtime model of the Lambda perform. Offered that the Lambda perform makes use of the Python 3.9 runtime, full the next steps to create the corresponding layer bundle for peycopog2:

Obtain psycopg2_binary-2.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from https://pypi.org/venture/psycopg2-binary/#information.
Unzip and transfer the contents to a listing named python:
```
zip ‘python’ listing
```
Create a Lambda layer for psycopg2 utilizing the zip file.
Assign the layer to the Lambda perform by selecting Add a layer within the deployed perform properties.
Validate the AWS CDK deployment.

Your Lambda perform particulars ought to look much like the next screenshot.

On the Methods Supervisor console, validate the Parameter Retailer content material for precise values.

The IAM position particulars ought to look much like the next code, which permits the Lambda perform entry to Amazon EMR and underlying EC2 cases, Price Explorer, CloudWatch, Secrets and techniques Supervisor, and Parameter Retailer:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Action": [
        "ce:GetCostAndUsage",
        "ce:ListCostAllocationTags",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:DeleteNetworkInterface",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInstances",
        "ec2:DescribeNetworkInterfaces",
        "elasticmapreduce:Describe*",
        "elasticmapreduce:List*",
        "ssm:Describe*",
        "ssm:Get*",
        "ssm:List*"
      ],
      "Useful resource": "*",
      "Impact": "Permit"
    },
    {
      "Motion": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents"
      ],
      "Useful resource": "arn:aws:logs:*:*:*",
      "Impact": "Permit"
    },
    {
      "Motion": "secretsmanager:GetSecretValue",
      "Useful resource": "arn:aws:secretsmanager:*:*:*",
      "Impact": "Permit"
    }
  ]
}

Take a look at the answer

To check the answer, you’ll be able to run a Spark job that mixes a number of information within the EMR cluster, and you are able to do this by creating separate steps throughout the cluster. Consult with Optimize Amazon EMR prices for legacy and Spark workloads for extra particulars on easy methods to add the roles as steps to EMR cluster.

Use the next pattern command to submit the Spark job (emr_union_job.py).
It takes in three arguments:
1. <input_full_path> – The Amazon S3 location of the info file that’s learn in by the Spark job. The trail shouldn’t be modified. The input_full_path is s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet
2. <output_path> – The S3 folder the place the outcomes are written to.
3. <variety of copies to be unioned> – By altering the enter to the Spark job, you can also make certain the job runs for various quantities of time and likewise change the variety of Spot nodes used.

spark-submit --deploy-mode cluster --name HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 6

spark-submit --deploy-mode cluster --name FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 12

The next screenshot reveals the log of the steps run on the Amazon EMR console.

Run the deployed Lambda perform from the Lambda console. This hundreds the day by day software log, EMR greenback utilization, and EMR occasion utilization particulars into their respective RDS tables.

The next screenshot of the Amazon RDS question editor reveals the outcomes for public.emr_applications_execution_log.

The next screenshot reveals the outcomes for public.emr_cluster_usage_cost.

The next screenshot reveals the outcomes for public.emr_cluster_instances_usage.

Price might be calculated utilizing the previous three tables based mostly in your necessities. Within the following SQL question, you calculate the associated fee based mostly on relative utilization of all purposes in a day. You first determine the whole vcore-seconds CPU consumed in a day after which discover out the proportion share of an software. This drives the associated fee based mostly on total cluster value in a day.

Think about the next instance situation, the place 10 purposes ran on the cluster for a given day. You’ll use the next sequence of steps to calculate the chargeback value:

Calculate the relative proportion utilization of every software (consumed vcore-seconds CPU by app/complete vcore-seconds CPU consumed).
Now you’ve got the relative useful resource consumption of every software, distribute the cluster value to every software. Let’s assume that the whole EMR cluster value for that date is $400.

app_id	app_name	runtime_seconds	vcore_seconds	% Relative Utilization	Amazon EMR Price ($)
application_00001	app1	10	120	5%	19.83
application_00002	app2	5	60	2%	9.91
application_00003	app3	4	45	2%	7.43
application_00004	app4	70	840	35%	138.79
application_00005	app5	21	300	12%	49.57
application_00006	app6	4	48	2%	7.93
application_00007	app7	12	150	6%	24.78
application_00008	app8	52	620	26%	102.44
application_00009	app9	12	130	5%	21.48
application_00010	app10	9	108	4%	17.84

A pattern chargeback value calculation SQL question is obtainable on the GitHub repo.

You should use the SQL question to create a report dashboard to plot a number of charts for the insights. The next are two examples created utilizing QuickSight.

The next is a day by day bar chart.

The next reveals complete {dollars} consumed.

Resolution value

Let’s assume we’re calculating for an setting that runs 1,000 jobs day by day, and we run this resolution day by day:

Lambda prices – One run requires 30 Lambda perform invocations per thirty days.
Amazon RDS value – The whole variety of information within the public.emr_applications_execution_log desk for a 30-day month can be 30,000 information, which interprets to five.72 MB of storage. If we take into account the opposite two smaller tables and storage overhead, the general month-to-month storage requirement can be roughly 12 MB.

In abstract, the answer value in response to the AWS Pricing Calculator is $34.20/yr, which is negligible.

Clear up

To keep away from ongoing fees for the sources that you simply created, full the next steps:

Delete the AWS CDK stacks:
Delete the QuickSight report and dashboard, if created.

Run the next SQL to drop the tables:

drop desk public.emr_applications_execution_log_lz;
drop desk public.emr_applications_execution_log;
drop desk public.emr_cluster_usage_cost;
drop desk public.emr_cluster_instances_usage;

Conclusion

With this resolution, you’ll be able to deploy a chargeback mannequin to attribute prices to customers and teams utilizing the EMR cluster. You may also determine choices for optimization, scaling, and separation of workloads to totally different clusters based mostly on utilization and progress wants.

You possibly can gather the metrics for an extended period to look at tendencies on the utilization of Amazon EMR sources and use that for forecasting functions.

In case you have any ideas or questions, depart them within the feedback part.

Concerning the Authors

Raj Patel is AWS Lead Advisor for Knowledge Analytics options based mostly out of India. He makes a speciality of constructing and modernising analytical options. His background is in knowledge warehouse/knowledge lake – structure, improvement and administration. He’s in knowledge and analytical discipline for over 14 years.

Ramesh Raghupathy is a Senior Knowledge Architect with WWCO ProServe at AWS. He works with AWS clients to architect, deploy, and migrate to knowledge warehouses and knowledge lakes on the AWS Cloud. Whereas not at work, Ramesh enjoys touring, spending time with household, and yoga.

Gaurav Jain is a Sr Knowledge Architect with AWS Skilled Providers, specialised in massive knowledge and helps clients modernize their knowledge platforms on the cloud. He’s keen about constructing the best analytics options to realize well timed insights and make crucial enterprise selections. Exterior of labor, he likes to spend time along with his household and likes watching films and sports activities.

Dipal Mahajan is a Lead Advisor with Amazon Internet Providers based mostly out of India, the place he guides international clients to construct extremely safe, scalable, dependable, and cost-efficient purposes on the cloud. He brings intensive expertise on Software program Improvement, Structure and Analytics from industries like finance, telecom, retail and healthcare.

Attribute Amazon EMR on EC2 prices to your end-users

Resolution overview

Stipulations

Create RDS tables

Deploy AWS CDK stacks

Take a look at the answer

Resolution value

Clear up

Conclusion

Concerning the Authors

Related Articles

A Information to Product Data Administration

Anthropic brings code overview into Claude Code

How On-line Buying Apps Can Enhance Gross sales: The Final Information

LEAVE A REPLY Cancel reply

Latest Articles

A Information to Product Data Administration

Anthropic brings code overview into Claude Code

How On-line Buying Apps Can Enhance Gross sales: The Final Information

Why Check Environments Fail—and What High Groups Do to Keep away from the Chaos

Cease Paving the Cowpath: Why Agentic-First Is the Solely Option to Construct for the Enterprise