19.1 C
New York
Tuesday, September 16, 2025

Break down information silos and seamlessly question Iceberg tables in Amazon SageMaker from Snowflake


Organizations typically wrestle to unify their information ecosystems throughout a number of platforms and providers. The connectivity between Amazon SageMaker and Snowflake’s AI Knowledge Cloud affords a robust resolution to this problem, so companies can benefit from the strengths of each environments whereas sustaining a cohesive information technique.

On this publish, we show how one can break down information silos and improve your analytical capabilities by querying Apache Iceberg tables within the lakehouse structure of SageMaker instantly from Snowflake. With this functionality, you possibly can entry and analyze information saved in Amazon Easy Storage Service (Amazon S3) by means of AWS Glue Knowledge Catalog utilizing an AWS Glue Iceberg REST endpoint, all secured by AWS Lake Formation, with out the necessity for complicated extract, remodel, and cargo (ETL) processes or information duplication. You may also automate desk discovery and refresh utilizing Snowflake catalog-linked databases for Iceberg. Within the following sections, we present learn how to arrange this integration so Snowflake customers can seamlessly question and analyze information saved in AWS, thereby enhancing information accessibility, lowering redundancy, and enabling extra complete analytics throughout your total information ecosystem.

Enterprise use circumstances and key advantages

The aptitude to question Iceberg tables in SageMaker from Snowflake delivers vital worth throughout a number of industries:

  • Monetary providers – Improve fraud detection by means of unified evaluation of transaction information and buyer conduct patterns
  • Healthcare – Enhance affected person outcomes by means of built-in entry to medical, claims, and analysis information
  • Retail – Enhance buyer retention charges by connecting gross sales, stock, and buyer conduct information for personalised experiences
  • Manufacturing – Increase manufacturing effectivity by means of unified sensor and operational information analytics
  • Telecommunications – Scale back buyer churn with complete evaluation of community efficiency and buyer utilization information

Key advantages of this functionality embrace:

  • Accelerated decision-making – Scale back time to perception by means of built-in information entry throughout platforms
  • Value optimization – Speed up time to perception by querying information instantly in storage with out the necessity for ingestion
  • Improved information constancy – Scale back information inconsistencies by establishing a single supply of reality
  • Enhanced collaboration – Enhance cross-functional productiveness by means of simplified information sharing between information scientists and analysts

By utilizing the lakehouse structure of SageMaker with Snowflake’s serverless and zero-tuning computational energy, you possibly can break down information silos, enabling complete analytics and democratizing information entry. This integration helps a contemporary information structure that prioritizes flexibility, safety, and analytical efficiency, in the end driving quicker, extra knowledgeable decision-making throughout the enterprise.

Answer overview

The next diagram exhibits the structure for catalog integration between Snowflake and Iceberg tables within the lakehouse.

Catalog integration to query Iceberg tables in S3 bucket using Iceberg REST Catalog (IRC) with credential vending

The workflow consists of the next elements:

  • Knowledge storage and administration:
    • Amazon S3 serves as the first storage layer, internet hosting the Iceberg desk information
    • The Knowledge Catalog maintains the metadata for these tables
    • Lake Formation supplies credential merchandising
  • Authentication movement:
    • Snowflake initiates queries utilizing a catalog integration configuration
    • Lake Formation vends momentary credentials by means of AWS Safety Token Service (AWS STS)
    • These credentials are mechanically refreshed based mostly on the configured refresh interval
  • Question movement:
    • Snowflake customers submit queries in opposition to the mounted Iceberg tables
    • The AWS Glue Iceberg REST endpoint processes these requests
    • Question execution makes use of Snowflake’s compute assets whereas studying instantly from Amazon S3
    • Outcomes are returned to Snowflake customers whereas sustaining all safety controls

There are 4 patterns to question Iceberg tables in SageMaker from Snowflake:

  • Iceberg tables in an S3 bucket utilizing an AWS Glue Iceberg REST endpoint and Snowflake Iceberg REST catalog integration, with credential merchandising from Lake Formation
  • Iceberg tables in an S3 bucket utilizing an AWS Glue Iceberg REST endpoint and Snowflake Iceberg REST catalog integration, utilizing Snowflake exterior volumes to Amazon S3 information storage
  • Iceberg tables in an S3 bucket utilizing AWS Glue API catalog integration, additionally utilizing Snowflake exterior volumes to Amazon S3
  • Amazon S3 Tables utilizing Iceberg REST catalog integration with credential merchandising from Lake Formation

On this publish, we implement the primary of those 4 entry patterns utilizing catalog integration for the AWS Glue Iceberg REST endpoint with Signature Model 4 (SigV4) authentication in Snowflake.

Conditions

You could have the next conditions:

The answer takes roughly 30–45 minutes to arrange. Value varies based mostly on information quantity and question frequency. Use the AWS Pricing Calculator for particular estimates.

Create an IAM function for Snowflake

To create an IAM function for Snowflake, you first create a coverage for the function:

  1. On the IAM console, select Insurance policies within the navigation pane.
  2. Select Create coverage.
  3. Select the JSON editor and enter the next coverage (present your AWS Area and account ID), then select Subsequent.
{
     "Model": "2012-10-17",
     "Assertion": [
         {
             "Sid": "AllowGlueCatalogTableAccess",
             "Effect": "Allow",
             "Action": [
                 "glue:GetCatalog",
                 "glue:GetCatalogs",
                 "glue:GetPartitions",
                 "glue:GetPartition",
                 "glue:GetDatabase",
                 "glue:GetDatabases",
                 "glue:GetTable",
                 "glue:GetTables",
                 "glue:UpdateTable"
             ],
             "Useful resource": [
                 "arn:aws:glue:<region>:<account-id>:catalog",
                 "arn:aws:glue:<region>:<account-id>:database/iceberg_db",
                 "arn:aws:glue:<region>:<account-id>:table/iceberg_db/*",
             ]
         },
         {
             "Impact": "Enable",
             "Motion": [
                 "lakeformation:GetDataAccess"
             ],
             "Useful resource": "*"
         }
     ]
 }

  1. Enter iceberg-table-access because the coverage identify.
  2. Select Create coverage.

Now you possibly can create the function and fix the coverage you created.

  1. Select Roles within the navigation pane.
  2. Select Create function.
  3. Select AWS account.
  4. Underneath Choices, choose Require Exterior Id and enter an exterior ID of your alternative.
  5. Select Subsequent.
  6. Select the coverage you created (iceberg-table-access coverage).
  7. Enter snowflake_access_role because the function identify.
  8. Select Create function.

Configure Lake Formation entry controls

To configure your Lake Formation entry controls, first arrange the appliance integration:

  1. Check in to the Lake Formation console as an information lake administrator.
  2. Select Administration within the navigation pane.
  3. Choose Software integration settings.
  4. Allow Enable exterior engines to entry information in Amazon S3 areas with full desk entry.
  5. Select Save.

Now you possibly can grant permissions to the IAM function.

  1. Select Knowledge permissions within the navigation pane.
  2. Select Grant.
  3. Configure the next settings:
    1. For Principals, choose IAM customers and roles and select snowflake_access_role.
    2. For Assets, choose Named Knowledge Catalog assets.
    3. For Catalog, select your AWS account ID.
    4. For Database, select iceberg_db.
    5. For Desk, select buyer.
    6. For Permissions, choose SUPER.
  4. Select Grant.

SUPER entry is required for mounting the Iceberg desk in Amazon S3 as a Snowflake desk.

Register the S3 information lake location

Full the next steps to register the S3 information lake location:

  1. As information lake administrator on the Lake Formation console, select Knowledge lake areas within the navigation pane.
  2. Select Register location.
  3. Configure the next:
    1. For S3 path, enter the S3 path to the bucket the place you’ll retailer your information.
    2. For IAM function, select LakeFormationLocationRegistrationRole.
    3. For Permission mode, select Lake Formation.
  4. Select Register location.

Arrange the Iceberg REST integration in Snowflake

Full the next steps to arrange the Iceberg REST integration in Snowflake:

  1. Log in to Snowflake as an admin consumer.
  2. Execute the next SQL command (present your Area, account ID, and exterior ID that you just offered throughout IAM function creation):
CREATE OR REPLACE CATALOG INTEGRATION glue_irc_catalog_int
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'iceberg_db'
REST_CONFIG = (
    CATALOG_URI = 'https://glue.<area>.amazonaws.com/iceberg'
    CATALOG_API_TYPE = AWS_GLUE
    CATALOG_NAME = '<account-id>'
    ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
)
REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = 'arn:aws:iam::<account-id>:function/snowflake_access_role'
    SIGV4_SIGNING_REGION = '<area>'
    SIGV4_EXTERNAL_ID = '<external-id>'
)
REFRESH_INTERVAL_SECONDS = 120
ENABLED = TRUE;

  1. Execute the next SQL command and retrieve the worth for API_AWS_IAM_USER_ARN:

DESCRIBE CATALOG INTEGRATION glue_irc_catalog_int;

  1. On the IAM console, replace the belief relationship for snowflake_access_role with the worth for API_AWS_IAM_USER_ARN:
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                   "<API_AWS_IAM_USER_ARN>"
                ]
            },
            "Motion": "sts:AssumeRole",
            "Situation": {
                "StringEquals": {
                    "sts:ExternalId": [
                        "<external-id>"
                    ]
                }
            }
        }
    ]
}

  1. Confirm the catalog integration:

SELECT SYSTEM$VERIFY_CATALOG_INTEGRATION('glue_irc_catalog_int');

  1. Mount the S3 desk as a Snowflake desk:
CREATE OR REPLACE ICEBERG TABLE s3iceberg_customer
 CATALOG = 'glue_irc_catalog_int'
 CATALOG_NAMESPACE = 'iceberg_db'
 CATALOG_TABLE_NAME = 'buyer'
 AUTO_REFRESH = TRUE;

Question the Iceberg desk from Snowflake

To check the configuration, log in to Snowflake as an admin consumer and run the next pattern question:SELECT * FROM s3iceberg_customer LIMIT 10;

Clear up

To scrub up your assets, full the next steps:

  1. Delete the database and desk in AWS Glue.
  2. Drop the Iceberg desk, catalog integration, and database in Snowflake:
DROP ICEBERG TABLE iceberg_customer;
DROP CATALOG INTEGRATION glue_irc_catalog_int;

Ensure all assets are correctly cleaned as much as keep away from surprising fees.

Conclusion

On this publish, we demonstrated learn how to set up a safe and environment friendly connection between your Snowflake surroundings and SageMaker to question Iceberg tables in Amazon S3. This functionality may also help your group keep a single supply of reality whereas additionally letting groups use their most well-liked analytics instruments, in the end breaking down information silos and enhancing collaborative evaluation capabilities.

To additional discover and implement this resolution in your surroundings, contemplate the next assets:

  • Technical documentation:
  • Associated weblog posts:

These assets may also help you to implement and optimize this integration sample on your particular use case. As you start this journey, bear in mind to start out small, validate your structure with take a look at information, and step by step scale your implementation based mostly in your group’s wants.


In regards to the authors

Nidhi Gupta

Nidhi Gupta

Nidhi is a Senior Associate Options Architect at AWS, specializing in information and analytics. She helps prospects and companions construct and optimize Snowflake workloads on AWS. Nidhi has intensive expertise main manufacturing releases and deployments, with concentrate on Knowledge, AI, ML, generative AI, and Superior Analytics.

Andries Engelbrecht

Andries Engelbrecht

Andries is a Principal Associate Options Engineer at Snowflake working with AWS. He helps product and repair integrations, as effectively the event of joint options with AWS. Andries has over 25 years of expertise within the area of information and analytics.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles