-10.4 C
New York
Monday, December 23, 2024

Migrate Delta tables from Azure Knowledge Lake Storage to Amazon S3 utilizing AWS Glue


Organizations are more and more utilizing a multi-cloud technique to run their manufacturing workloads. We frequently see requests from prospects who’ve began their knowledge journey by constructing knowledge lakes on Microsoft Azure, to increase entry to the information to AWS providers. Clients need to use quite a lot of AWS analytics, knowledge, AI, and machine studying (ML) providers like AWS Glue, Amazon Redshift, and Amazon SageMaker to construct extra cost-efficient, performant knowledge options harnessing the power of particular person cloud service suppliers for his or her enterprise use circumstances.

In such situations, knowledge engineers face challenges in connecting and extracting knowledge from storage containers on Microsoft Azure. Clients usually use Azure Knowledge Lake Storage Gen2 (ADLS Gen2) as their knowledge lake storage medium and retailer the information in open desk codecs like Delta tables, and need to use AWS analytics providers like AWS Glue to learn the delta tables. AWS Glue, with its potential to course of knowledge utilizing Apache Spark and join to numerous knowledge sources, is an acceptable resolution for addressing the challenges of accessing knowledge throughout a number of cloud environments.

AWS Glue is a serverless knowledge integration service that makes it easy to find, put together, and mix knowledge for analytics, ML, and software growth. AWS Glue customized connectors assist you to uncover and combine further knowledge sources, similar to software program as a service (SaaS) functions and your customized knowledge sources. With just some clicks, you possibly can seek for and subscribe to connectors from AWS Market and start your knowledge preparation workflow in minutes.

On this submit, we clarify how one can extract knowledge from ADLS Gen2 utilizing the Azure Knowledge Lake Storage Connector for AWS Glue. We particularly display easy methods to import knowledge saved in Delta tables in ADLS Gen2. We offer step-by-step steerage on easy methods to configure the connector, writer an AWS Glue ETL (extract, rework, and cargo) script, and cargo the extracted knowledge into Amazon Easy Storage Service (Amazon S3).

Azure Knowledge Lake Storage Connector for AWS Glue

The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. It makes use of the Hadoop’s FileSystem interface and the ADLS Gen2 connector for Hadoop. The Azure Knowledge Lake Storage Connector for AWS Glue additionally consists of the hadoop-azure module, which helps you to run Apache Hadoop or Apache Spark jobs straight with knowledge in ADLS. When the connector is added to the AWS Glue surroundings, AWS Glue masses the library from the Amazon Elastic Container Registry (Amazon ECR) repository throughout initialization (as a connector). When AWS Glue has web entry, the Spark job in AWS Glue can learn from and write to ADLS.

With the provision of the Azure Knowledge Lake Storage Connector for AWS Glue in AWS Market, an AWS Glue connection makes certain you will have the required packages to make use of in your AWS Glue job.

For this submit, we use the Shared Key authentication methodology.

Resolution overview

On this submit, our goal is emigrate a product desk named sample_delta_table, which at the moment resides in ADLS Gen2, to Amazon S3. To perform this, we use AWS Glue, the Azure Knowledge Lake Storage Connector for AWS Glue, and AWS Secrets and techniques Supervisor to securely retailer the Azure shared key. We employed an AWS Glue serverless ETL job, configured with the connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. After the desk is migrated to Amazon S3, we use Amazon Athena to question Delta Lake tables.

The next structure diagram illustrates how AWS Glue facilitates knowledge ingestion from ADLS.

Stipulations

You want the next conditions:

Configure your ADLS Gen2 account in Secrets and techniques Supervisor

Full the next steps to create a secret in Secrets and techniques Supervisor to retailer the ADLS credentials:

  1. On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
  2. For Secret sort, choose Different sort of secret.
  3. Enter the important thing accountName for the ADLS Gen2 storage account title.
  4. Enter the important thing accountKey for the ADLS Gen2 storage account key.
  5. Enter the important thing container for the ADLS Gen2 container.
  6. Go away the remainder of the choices as default and select Subsequent.

  1. Enter a reputation for the key (for instance, adlstorage_credentials).
  2. Select Subsequent.
  3. Full the remainder of the steps to retailer the key.

Subscribe to the Azure Knowledge Lake Storage Connector for AWS Glue

The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. The connector is offered as an AWS Market providing.

Full the next steps to subscribe to the connector:

  1. Log in to your AWS account with the mandatory permissions.
  2. Navigate to the AWS Market web page for the Azure Knowledge Lake Storage Connector for AWS Glue.
  3. Select Proceed to Subscribe.
  4. Select Proceed to Configuration after studying the EULA.

  1. For Fulfilment choice, select Glue 4.0.
  2. For Software program model, select the newest software program model.
  3. Select Proceed to Launch.

Create a customized connection in AWS Glue

After you’re subscribed to the connector, full the next steps to create an AWS Glue connection primarily based on it. This connection will likely be added to the AWS Glue job to ensure the connector is offered and the information retailer connection data is accessible to determine a community pathway.

To create the AWS Glue connection, you could activate the Azure Knowledge Lake Storage Connector for AWS Glue on the AWS Glue Studio console. After you select Proceed to Launch within the earlier steps, you’re redirected to the connector touchdown web page.

  1. Within the Configuration particulars part, select Utilization directions.
  2. Select Activate the Glue connector from AWS Glue Studio.

The AWS Glue Studio console permits the choice to both activate the connector or activate it and create the connection in a single step. For this submit, we select the second choice.

  1. For Connector, affirm Azure ADLS Connector for AWS Glue 4.0 is chosen.
  2. For Title, enter a reputation for the connection (for instance, AzureADLSStorageGen2Connection).
  3. Enter an non-compulsory description.
  4. Select Create connection and activate connector.

The connection is now prepared to be used. The connector and connection data is seen on the Knowledge connections web page of the AWS Glue console.


Learn Delta tables from ADLS Gen2 utilizing the connector in an AWS Glue ETL job

Full the next steps to create an AWS Glue job and configure the AWS Glue connection and job parameter choices:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Select Creator code with a script editor and select Script editor.
  3. Select Create script and go to the Job particulars part.
  4. Replace the settings for Title and IAM function.
  5. Underneath Superior properties, add the AWS Glue connection AzureADLSStorageGen2Connection created in earlier steps.
  1. For Job parameters, add the important thing --datalake-formats with the worth as delta.
  1. Use the next script to learn the Delta desk from ADLS. Present the trail to the place you will have Delta desk recordsdata in your Azure storage account container and the S3 bucket for writing delta recordsdata to the output S3 location.
from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.consumer('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
worth = json.masses(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{worth['accountName']}.dfs.core.home windows.internet"
account_name = worth['accountName']
account_key = worth['accountKey']
container_name = worth['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.home windows.internet/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Methodology: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Learn delta desk from ADLS gen2 storage
df = spark.learn.format("delta").load(path)

# Write delta desk to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to current s3 delta desk")
    (s3deltaTable.alias("goal")
        .merge(df.alias("supply"), "goal.product_id = supply.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta desk to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

  1. Select Run to start out the job.
  2. On the Runs tab, affirm the job ran efficiently.
  3. On the Amazon S3 console, confirm the delta recordsdata within the S3 bucket (Delta desk path).
  4. Create a database and desk in Athena to question the migrated Delta desk in Amazon S3.

You possibly can accomplish this step utilizing an AWS Glue crawler. The crawler can robotically crawl your Delta desk saved in Amazon S3 and create the mandatory metadata within the AWS Glue Knowledge Catalog. Athena can then use this metadata to question and analyze the Delta desk seamlessly. For extra data, see Crawl Delta Lake tables utilizing AWS Glue crawlers.

CREATE EXTERNAL TABLE deltadb.sample_delta_table
LOCATION 's3://yourdatalakebucketname/deltatablepath/'
TBLPROPERTIES ('table_type'='DELTA');

12. Question the Delta desk:

SELECT * FROM "deltadb"."sample_delta_table" restrict 10;

By following the steps outlined within the submit, you will have efficiently migrated a Delta desk from ADLS Gen2 to Amazon S3 utilizing an AWS Glue ETL job.

Learn the Delta desk in an AWS Glue pocket book

The next are non-compulsory steps if you wish to learn the Delta desk from ADLS Gen2 in an AWS Glue pocket book:

  1. Create a pocket book and run the next code within the first pocket book cell to configure the AWS Glue connection and --datalake-formats in an interactive session:
%idle_timeout 30
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%connections AzureADLSStorageGen2Connection
%%configure
{
   "--datalake-formats": "delta"
}

  1. Run the next code in a brand new cell to learn the Delta desk saved in ADLS Gen 2. Present the trail to the place you will have delta recordsdata in an Azure storage account container and the S3 bucket for writing delta recordsdata to Amazon S3.
from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.consumer('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
worth = json.masses(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{worth['accountName']}.dfs.core.home windows.internet"
account_name = worth['accountName']
account_key = worth['accountKey']
container_name = worth['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.home windows.internet/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Methodology: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Learn delta desk from ADLS gen2 storage
df = spark.learn.format("delta").load(path)

# Write delta desk to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to current s3 delta desk")
    (s3deltaTable.alias("goal")
        .merge(df.alias("supply"), "goal.product_id = supply.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta desk to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

Clear up

To wash up your assets, full the next steps:

  1. Take away the AWS Glue job, database, desk, and connection:
    1. On the AWS Glue console, select Tables within the navigation pane, choose sample_delta_table, and select Delete.
    2. Select Databases within the navigation pane, choose deltadb, and select Delete.
    3. Select Connections within the navigation pane, choose AzureADLSStorageGen2Connection, and on the Actions menu, select Delete.
  2. On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane, choose adlstorage_credentials, and on the Actions menu, select Delete secret.
  3. In case you are not going to make use of this connector, you possibly can cancel the subscription to the connector:
    1. On the AWS Market console, select Handle subscriptions.
    2. Choose the subscription for the product that you simply need to cancel, and on the Actions menu, select Cancel subscription.
    3. Learn the data offered and choose the acknowledgement verify field.
    4. Select Sure, cancel subscription.
  4. On the Amazon S3 console, delete the information within the S3 bucket that you simply used within the earlier steps. 

You too can use the AWS Command Line Interface (AWS CLI) to take away the AWS Glue and Secrets and techniques Supervisor assets. Take away the AWS Glue job, database, desk, connection, and Secrets and techniques Supervisor secret with the next command:

aws glue delete-job —job-name <your_job_name>
aws glue delete-connection —connection-name <your_connection_name>
aws secretsmanager delete-secret —secret-id <your_secretsmanager_id>
aws glue delete-table --database-name deltadb --name sample_delta_table
aws glue delete-database --name deltadb

Conclusion

On this submit, we demonstrated a real-world instance of migrating a Delta desk from Azure Delta Lake Storage Gen2 to Amazon S3 utilizing AWS Glue. We used an AWS Glue serverless ETL job, configured with an AWS Market connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. Moreover, we used Secrets and techniques Supervisor to securely retailer the shared key and seamlessly combine it throughout the AWS Glue ETL job, offering a safe and environment friendly migration course of. Lastly, we offered steerage on querying the Delta Lake desk from Athena.

Check out the answer in your personal use case, and tell us your suggestions and questions within the feedback.


In regards to the Authors

Nitin Kumar is a Cloud Engineer (ETL) at Amazon Internet Providers, specialised in AWS Glue. With a decade of expertise, he excels in aiding prospects with their huge knowledge workloads, specializing in knowledge processing and analytics. He’s dedicated to serving to prospects overcome ETL challenges and develop scalable knowledge processing and analytics pipelines on AWS. In his free time, he likes to observe films and spend time along with his household.

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru, specialised in AWS Glue and Amazon Athena. He’s enthusiastic about serving to prospects clear up points associated to their ETL workload and implement scalable knowledge processing and analytics pipelines on AWS. In his free time, Shubham likes to spend time along with his household and journey around the globe.

Pramod Kumar P is a Options Architect at Amazon Internet Providers. With 19 years of expertise expertise and near a decade of designing and architecting connectivity options (IoT) on AWS, he guides prospects to construct options with the appropriate architectural tenets to fulfill their enterprise outcomes.

Madhavi Watve is a Senior Options Architect at Amazon Internet Providers, offering assist and steerage to a broad vary of consumers to construct extremely safe, scalable, dependable, and cost-efficient functions on the cloud. She brings over 20 years of expertise expertise in software program growth and structure and is knowledge analytics specialist.

Swathi S is a Technical Account Supervisor with the Enterprise Help staff in Amazon Internet Providers. She has over 6 years of expertise with AWS on huge knowledge applied sciences and focuses on analytics frameworks. She is enthusiastic about serving to AWS prospects navigate the cloud area and enjoys aiding with design and optimization of analytics workloads on AWS.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles