11.7 C
New York
Thursday, April 3, 2025

Unlock scalable analytics with a safe connectivity sample in AWS Glue to learn from or write to Snowflake


In at the moment’s data-driven world, the power to seamlessly combine and make the most of various knowledge sources is crucial for gaining actionable insights and driving innovation. As organizations more and more depend on knowledge saved throughout varied platforms, reminiscent of Snowflake, Amazon Easy Storage Service (Amazon S3), and varied software program as a service (SaaS) functions, the problem of bringing these disparate knowledge sources collectively has by no means been extra urgent.

AWS Glue is a sturdy knowledge integration service that facilitates the consolidation of information from totally different origins, empowering companies to make use of the total potential of their knowledge belongings. By utilizing AWS Glue to combine knowledge from Snowflake, Amazon S3, and SaaS functions, organizations can unlock new alternatives in generative synthetic intelligence (AI), machine studying (ML), enterprise intelligence (BI), and self-service analytics or feed knowledge to underlying functions.

On this submit, we discover how AWS Glue can function the info integration service to deliver the info from Snowflake to your knowledge integration technique, enabling you to harness the facility of your knowledge ecosystem and drive significant outcomes throughout varied use instances.

Use case

Contemplate a big ecommerce firm that depends closely on data-driven insights to optimize its operations, advertising methods, and buyer experiences. The corporate shops huge quantities of transactional knowledge, buyer info, and product catalogs in Snowflake. Nevertheless, in addition they generate and acquire knowledge from varied different sources, reminiscent of net logs saved in Amazon S3, social media platforms, and third-party knowledge suppliers. To achieve a complete understanding of their enterprise and make knowledgeable selections, the corporate must combine and analyze knowledge from all these sources seamlessly.

One essential enterprise requirement for the ecommerce firm is to generate a Pricing Abstract Report that gives an in depth evaluation of pricing and discounting methods. This report is crucial for understanding income streams, figuring out alternatives for optimization, and making data-driven selections relating to pricing and promotions. After the Pricing Abstract Report is generated and saved in Amazon S3, the corporate can use AWS analytics companies to generate interactive BI dashboards and run one-time queries on the report. This enables enterprise analysts and decision-makers to achieve precious insights, visualize key metrics, and discover the info in depth, enabling knowledgeable decision-making and strategic planning for pricing and promotional methods.

Answer overview

The next structure diagram illustrates a safe and environment friendly answer of integrating Snowflake knowledge with Amazon S3, utilizing the native Snowflake connector in AWS Glue. This setup makes use of AWS PrivateLink to supply safe connectivity between AWS companies throughout totally different digital non-public clouds (VPCs), eliminating the necessity to expose knowledge to the general public web, which is a crucial want for organizations.

BDB-4354-architecture

The next are the important thing elements and steps within the integration course of:

  1. Set up a safe, non-public connection between your AWS account and your Snowflake account utilizing PrivateLink. This includes creating VPC endpoints in each the AWS and Snowflake VPCs, ensuring knowledge switch stays inside the AWS community.
  2. Use Amazon Route 53 to create a non-public hosted zone that resolves the Snowflake endpoint inside your VPC. This enables AWS Glue jobs to hook up with Snowflake utilizing a non-public DNS identify, sustaining the safety and integrity of the info switch.
  3. Create an AWS Glue job to deal with the extract, remodel, and cargo (ETL) course of on knowledge from Snowflake to Amazon S3. The AWS Glue job makes use of the safe connection established by the VPC endpoints to entry Snowflake knowledge. Snowflake credentials are securely saved in AWS Secrets and techniques Supervisor. The AWS Glue job retrieves these credentials at runtime to authenticate and connect with Snowflake, offering safe entry administration. A VPC endpoint lets you securely talk with this service with out traversing the general public web, enhancing safety and efficiency.
  4. Retailer the extracted and reworked knowledge in Amazon S3. Manage the info into applicable constructions, reminiscent of partitioned folders, to optimize question efficiency and knowledge administration. We use a VPC endpoint enabled to securely talk with this service with out traversing the general public web, enhancing safety and efficiency. We additionally use Amazon S3 to retailer AWS Glue scripts, logs, and short-term knowledge generated throughout the ETL course of.

This strategy provides the next advantages:

  • Enhanced safety – By utilizing PrivateLink and VPC endpoints, knowledge switch between Snowflake and Amazon S3 is secured inside the AWS community, lowering publicity to potential safety threats.
  • Environment friendly knowledge integration – AWS Glue simplifies the ETL course of, offering a scalable and versatile answer for knowledge integration between Snowflake and Amazon S3.
  • Value-effectiveness – Utilizing Amazon S3 for knowledge storage, mixed with the AWS Glue pay-as-you-go pricing mannequin, helps optimize prices related to knowledge administration and integration.
  • Scalability and adaptability – The structure helps scalable knowledge transfers and could be prolonged to combine further knowledge sources and locations as wanted.

By following this structure and benefiting from the capabilities of AWS Glue, PrivateLink, and related AWS companies, organizations can obtain a strong, safe, and environment friendly knowledge integration answer, enabling them to harness the total potential of their Snowflake and Amazon S3 knowledge for superior analytics and BI.

Conditions

Full the next conditions earlier than establishing the answer:

  1. Confirm that you’ve entry to AWS account with the required permissions to provision sources in companies reminiscent of Route 53, Amazon S3, AWS Glue, Secrets and techniques Supervisor, and Amazon Digital Personal Cloud (Amazon VPC) utilizing AWS CloudFormation, which helps you to mannequin, provision, and handle AWS and third-party sources by treating infrastructure as code.
  2. Affirm that you’ve entry to Snowflake hosted in AWS with required permissions to run the steps to configure PrivateLink. Consult with Enabling AWS PrivateLink within the Snowflake documentation to confirm the steps, required entry degree, and repair degree to set the configurations. After you allow PrivateLink, save the worth of the next parameters supplied by Snowflake to make use of within the subsequent step on this submit:
    1. privatelink-vpce-id
    2. privatelink-account-url
    3. privatelink_ocsp-url
    4. regionless-snowsight-privatelink-url
  3. Be sure you have a Snowflake consumer snowflakeUser and password snowflakePassword with required permissions to learn from and write to Snowflake. The consumer and password are used within the AWS Glue connection to authenticate inside Snowflake.
  4. In case your Snowflake consumer doesn’t have a default warehouse set, you will want a warehouse identify. We use snowflakeWarehouse as a placeholder for the warehouse identify; change it together with your precise warehouse identify.
  5. For those who’re new to Snowflake, take into account finishing the Snowflake in 20 Minutes By the tip of the tutorial, you must know the right way to create required Snowflake objects, together with warehouses, databases, and tables for storing and querying knowledge.

Create sources with AWS CloudFormation

This submit features a CloudFormation template for a fast setup of the bottom sources. You’ll be able to overview and customise it to fit your wants if wanted. The CloudFormation template generates the next sources:

To create your sources, full the next steps:

  1. Register to the AWS CloudFormation console.
  2. Select Launch Stack to launch the CloudFormation stack.
  3. Present the CloudFormation stack parameters:
    1. For PrivateLinkAccountURL, enter the worth of the parameter privatelink-account-url obtained within the conditions.
    2. For PrivateLinkOcspURL, enter the worth of the parameter privatelink_ocsp-url obtained within the conditions.
    3. For PrivateLinkVpceId, enter the worth of the parameter privatelink-vpce-id obtained within the conditions.
    4. For PrivateSubnet1CIDR, enter the IP addresses to your non-public subnet 1.
    5. For PrivateSubnet2CIDR, enter the IP addresses to your non-public subnet 2.
    6. For PrivateSubnet3CIDR, enter the IP addresses to your non-public subnet 3.
    7. For PublicSubnet1CIDR, enter the IP addresses to your public subnet 1.
    8. For RegionlessSnowsightPrivateLinkURL, enter the worth of the parameter regionless-snowsight-privatelink-url obtained within the conditions.
    9. For VpcCIDR, enter the IP addresses to your VPC.
  4. Select Subsequent.
  5. Choose I acknowledge that AWS CloudFormation may create IAM sources.
  6. Select Submit and await the stack creation step to finish.

After the CloudFormation stack is efficiently created, you possibly can see all of the sources created on the Assets tab.

Navigate to the Outputs tab to see the outputs supplied by CloudFormation stack. Save the worth of the outputs GlueSecurityGroupId, VpcId, and PrivateSubnet1Id to make use of within the subsequent step on this submit.

BDB-4354-cfn-output

Replace the Secrets and techniques Supervisor secret with Snowflake credentials for the AWS Glue connection

To replace the Secrets and techniques Supervisor secret with consumer snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse that you’ll use within the AWS Glue connection to ascertain a connection to Snowflake, full the next steps:

  1. On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane.
  2. Open the key blog-glue-snowflake-credentials.
  3. Beneath Secret worth, select Retrieve secret worth.

BDB-4354-secrets-manager

  1. Select Edit.
  2. Enter the consumer snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse for the keys sfUser, sfPassword, and sfWarehouse, respectively.
  3. Select Save.

Create the AWS Glue connection for Snowflake

An AWS Glue connection is an AWS Glue Knowledge Catalog object that shops login credentials, URI strings, VPC info, and extra for a specific knowledge retailer. AWS Glue crawlers, jobs, and growth endpoints use connections as a way to entry sure forms of knowledge shops. To create an AWS Glue connection to Snowflake, full the next steps:

  1. On the AWS Glue console, within the navigation pane, underneath Knowledge catalog, select Connections.
  2. Select Create connection.
  3. For Knowledge sources, seek for and choose Snowflake.
  4. Select Subsequent.

BDB-4354-sf-data-source

  1. For Snowflake URL, enter https://<privatelink-account-url>.

To acquire the Snowflake PrivateLink account URL, confer with parameters obtained within the conditions.

  1. For AWS Secret, select the key blog-glue-snowflake-credentials.
  2. For VPC, select the VpcId worth obtained from the CloudFormation stack output.
  3. For Subnet, select the PrivateSubnet1Id worth obtained from the CloudFormation stack output.
  4. For Safety teams, select the GlueSecurityGroupId worth obtained from the CloudFormation stack output.
  5. Select Subsequent.

BDB-4354-sf-connection-setup

  1. Within the Connection Properties part, for Title, enter glue-snowflake-connection.
  2. Select Subsequent.

BDB-4354-sf-connection-properties

  1. Select Create connection.

Create an AWS Glue job

You’re now able to outline the AWS Glue job utilizing the Snowflake connection. To create an AWS Glue job to learn from Snowflake, full the next steps:

  1. On the AWS Glue console, underneath ETL jobs within the navigation pane, select Visible ETL.

BDB-4354-glue-studio

  1. Select the Job particulars tab.
  2. For Title, enter a reputation, for instance, Pricing Abstract Report Job.
  3. For Description, enter a significant description for the job.
  4. For IAM Position, select the position that has entry to the goal S3 location the place the job is writing to and the supply location from the place it’s loading the Snowflake knowledge and likewise to run the AWS Glue job. You could find this position in your CloudFormation stack output, named blog-glue-snowflake-GlueServiceRole-*.
  5. Use the default choices for Kind, Glue model, Language, Employee sort, Variety of employees, Variety of retries, and Job timeout.
  6. For Job bookmark, select Disable.
  7. Select Save to avoid wasting the job.

BDB-4354-glue-job-details

  1. On the Visible tab, select Add nodes.

  1. For Sources, select Snowflake.

  1. Select Knowledge supply – Snowflake within the AWS Glue Studio canvas.
  2. For Title, enter Snowflake_Pricing_Summary.
  3. For Snowflake connection, select glue-snowflake-connection.
  4. For Snowflake supply, choose Enter a customized question.
  5. For Database, enter snowflake_sample_data.
  6. For Snowflake question, add the next Snowflake question:
SELECT l_returnflag
    , l_linestatus
    , Sum(l_quantity) AS sum_qty
    , Sum(l_extendedprice) AS sum_base_price
    , Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price
    , Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge
    , Avg(l_quantity) AS avg_qty
    , Avg(l_extendedprice) AS avg_price
    , Avg(l_discount) AS avg_disc
    , Rely(*) AS count_order
FROM tpch_sf1.lineitem
WHERE l_shipdate <= Dateadd(day, - 90, To_date('1998-12-01'))
GROUP BY l_returnflag
    , l_linestatus
ORDER BY l_returnflag
    , l_linestatus;

The Pricing Abstract Report offers a abstract pricing report for all line gadgets shipped as of a given date. The date is inside 60–120 days of the best ship date contained within the database. The question lists totals for prolonged worth, discounted prolonged worth, discounted prolonged worth plus tax, common amount, common prolonged worth, and common low cost. These aggregates are grouped by RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS. A depend of the variety of line gadgets in every group is included.

  1. For Customized Snowflake properties, specify Key as sfSchema and Worth as tpch_sf1.
  2. Select Save.

BDB-4354-glue-source-setup

Subsequent, you add the vacation spot as an S3 bucket.

  1. On the Visible tab, select Add nodes.
  2. For Targets, select Amazon S3.

  1. Select Knowledge goal – S3 bucket within the AWS Glue Studio canvas.
  2. For Title, enter S3_Pricing_Summary.
  3. For Node dad and mom, choose Snowflake_Pricing_Summary.
  4. For Format, choose Parquet.
  5. For S3 Goal Location, enter s3://<YourBucketName>/pricing_summary_report/ (use the identify of your bucket).
  6. For Knowledge Catalog replace choices, choose Create a desk within the Knowledge Catalog and on subsequent runs, replace the schema and add new partitions.
  7. For Database, select db_blog_glue_snowflake.
  8. For Desk identify, enter tb_pricing_summary.
  9. Select Save.
  10. Select Run to run the job, and monitor its standing on the Runs tab.

You efficiently accomplished the steps to create an AWS Glue job that reads knowledge from Snowflake and masses the outcomes into an S3 bucket utilizing a safe connectivity sample. Ultimately, if you wish to remodel the info earlier than loading it into Amazon S3, you need to use AWS Glue transformations accessible in AWS Glue Studio. Utilizing AWS Glue transformations is essential when creating an AWS Glue job as a result of they permit environment friendly knowledge cleaning, enrichment, and restructuring, ensuring the info is within the desired format and high quality for downstream processes. Consult with Modifying AWS Glue managed knowledge remodel nodes for extra info.

Validate the outcomes

After the job is full, you possibly can validate the output of the ETL job run in Athena, a serverless interactive analytics service. To validate the output, full the next steps:

  1. On the Athena console, select Launch Question Editor.
  2. For Workgroup, select blog-workgroup.
  3. If the message “All queries run within the Workgroup, blog-workgroup, will use the next settings:” is displayed, select Acknowledge.
  4. For Database, select db_blog_glue_snowflake.
  5. For Question, enter the next assertion:
SELECT l_returnflag
    , l_linestatus
    , sum_qty
    , sum_base_price
FROM db_blog_glue_snowflake.tb_pricing_summary

  1. Select Run.

You could have efficiently validated your knowledge for the AWS Glue job Pricing Abstract Report Job.

Clear up

To scrub up your sources, full the next duties:

  1. Delete the AWS Glue job Pricing Abstract Report Job.
  2. Delete the AWS Glue connection glue-snowflake-connection.
  3. Cease any AWS Glue interactive classes.
  4. Delete content material from the S3 bucket blog-glue-snowflake-*.
  5. Delete the CloudFormation stack blog-glue-snowflake.

Conclusion

Utilizing the native Snowflake connector in AWS Glue offers an environment friendly and safe technique to combine knowledge from Snowflake into your knowledge pipelines on AWS. By following the steps outlined on this submit, you possibly can set up a non-public connectivity channel between AWS Glue and your Snowflake utilizing PrivateLink, Amazon VPC, safety teams, and Secrets and techniques Supervisor.

This structure means that you can learn knowledge from and write knowledge to Snowflake tables immediately from AWS Glue jobs working on Spark. The safe connectivity sample prevents knowledge transfers over the general public web, enhancing knowledge privateness and safety.

Combining AWS knowledge integration companies like AWS Glue with knowledge platforms like Snowflake means that you can construct scalable, safe knowledge lakes and pipelines to energy analytics, BI, knowledge science, and ML use instances.

In abstract, the native Snowflake connector and personal connectivity mannequin outlined right here present a performant, safe technique to embrace Snowflake knowledge in AWS huge knowledge workflows. This unlocks scalable analytics whereas sustaining knowledge governance, compliance, and entry management. For extra info on AWS Glue, go to AWS Glue.


In regards to the Authors

Caio Sgaraboto Montovani is a Sr. Specialist Options Architect, Knowledge Lake and AI/ML inside AWS Skilled Providers, growing scalable options in accordance buyer wants. His huge expertise has helped prospects in several industries reminiscent of life sciences and healthcare, retail, banking, and aviation construct options in knowledge analytics, machine studying, and generative AI. He’s captivated with rock and roll and cooking, and likes to spend time along with his household.

Kartikay Khator is a Options Architect inside International Life Sciences at AWS, the place he dedicates his efforts to growing modern and scalable options that cater to the evolving wants of consumers. His experience lies in harnessing the capabilities of AWS analytics companies. Extending past his skilled pursuits, he finds pleasure and achievement on the planet of working and mountaineering. Having already accomplished two marathons, he’s at present making ready for his subsequent marathon problem.

Navnit Shukla, an AWS Specialist Answer Architect specializing in Analytics, is captivated with serving to purchasers uncover precious insights from their knowledge. Leveraging his experience, he develops creative options that empower companies to make knowledgeable, data-driven selections. Notably, Navnit is the achieved writer of the e-book “Knowledge Wrangling on AWS,” showcasing his experience within the discipline.

BDB-4354-awskamenKamen Sharlandjiev is a Sr. Large Knowledge and ETL Options Architect, Amazon MWAA and AWS Glue ETL knowledgeable. He’s on a mission to make life simpler for purchasers who’re dealing with advanced knowledge integration and orchestration challenges. His secret weapon? Absolutely managed AWS companies that may get the job executed with minimal effort. Comply with Kamen on LinkedIn to maintain updated with the most recent Amazon MWAA and AWS Glue options and information!

Bosco Albuquerque is a Sr. Associate Options Architect at AWS and has over 20 years of expertise working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped expertise firms design and implement knowledge analytics options and merchandise.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles