28.8 C
New York
Thursday, September 18, 2025

Automate information lineage in Amazon SageMaker utilizing AWS Glue Crawlers supported information sources


The subsequent era of Amazon SageMaker is the middle for all of your information, analytics, and AI. Bringing collectively broadly adopted Amazon Internet Companies (AWS) machine studying (ML) and analytics capabilities, it delivers an built-in expertise for analytics and AI with unified entry to all of your information. From Amazon SageMaker Unified Studio, a single information and AI growth surroundings, you may entry your information and use a collection of highly effective instruments for information processing, SQL analytics, mannequin growth, coaching and inference, and generative AI growth.

With information lineage, now a part of Amazon SageMaker Catalog, you may centralize lineage metadata of your information property in a single place. You’ll be able to monitor the movement of information over time, figuring out a transparent understanding of the place it originated, the way it has modified, and its utilization throughout the enterprise. By offering this degree of transparency, information lineage helps information shoppers acquire belief that the information is appropriate and compliant for his or her use instances. With information lineage captured on the desk, column, and job degree, information producers can conduct impression evaluation of adjustments of their information pipelines and reply to information points when wanted, for instance, when a column within the ensuing dataset is lacking the standard required by the enterprise.

Information lineage is a strong instrument that may remodel how organizations perceive and handle their information flows. On this submit, we discover its real-world impression by means of the lens of an ecommerce firm striving to spice up their backside line.

As an example this sensible utility, we stroll you thru how you should use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to robotically seize lineage for information property saved in Amazon Easy Storage Service (Amazon S3) and Amazon DynamoDB. Utilizing this workflow, you may seize lineage robotically from extra information sources utilizing AWS Glue crawlers. Check with the Information lineage help matrix within the SageMaker Unified Studio Consumer Information for supported sources. We additionally use SageMaker Unified Studio to navigate these information property and study their origin, transformations, and dependencies, because of the lineage metadata captured utilizing the AWS Glue crawlers.

Key options of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you may discover and uncover information property of your group suited in your use case. As you dive into these information property, you may study extra about its enterprise context, schema, high quality, and lineage. Once you determine to work with a subset of those property, you may subscribe to them in a self-service trend and begin working with them. For extra element, go to Information discovery, subscription, and consumption within the SageMaker Unified Studio Consumer Information.

SageMaker Studio supplies a visible lineage graph that reveals how an information asset has advanced from its supply by means of transformations to its closing state. This helps information scientists, engineers, and analysts reply key questions comparable to:

  • The place did this information come from?
  • What transformations has it gone by means of?
  • Which downstream property shall be impacted by a change?

With this degree of visibility, groups can carry out quicker impression evaluation, discover the foundation trigger of information high quality points, and guarantee fashions are constructed on trusted information. It additionally helps higher collaboration so customers can confidently use and share information throughout the group. The next screenshot reveals how SageMaker Unified Studio visualizes information lineage, making it simple to hint information movement and perceive dependencies.

  • Column-level lineage – You’ll be able to increase column-level lineage when accessible in dataset nodes. This robotically reveals relationships with upstream or downstream dataset nodes if supply column data is on the market.
  • Column search – If the dataset has greater than 10 columns, the node presents pagination to navigate to columns not initially introduced. To shortly view a selected column, you may search on the dataset node that lists solely the searched column.
  • Particulars pane – Every lineage node captures and shows the next particulars:
    • Each dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the completely different variations of lineage occasion captured for that node.
    • The job node has a particulars pane to show job particulars with the tabs Job data and Historical past. The small print pane additionally captures queries or expressions run as a part of the job.
  • View dataset nodes solely – If you wish to filter out the job nodes, you may select the open view management icon within the graph viewer and toggle the show dataset nodes solely, which is able to take away all of the job nodes from the graph and allow you to navigate solely the dataset nodes.
  • Model tabs – All lineage nodes in Amazon DataZone information lineage could have versioning, captured as historical past, based mostly on lineage occasions captured. You’ll be able to view lineage at a particular timestamp that opens a brand new tab on the lineage web page to assist evaluate or distinction between the completely different timestamps.

You’ll be able to strive a few of these options as you discover the information property of this submit. To study extra on information lineage in SageMaker, we encourage you to dive deep into the Information lineage in Amazon SageMaker Unified Studio.

Resolution overview

Think about a state of affairs the place an ecommerce firm goals to optimize conversion charges and improve buyer expertise by gaining deeper insights into the shopper journey. They should join the dots between person interactions and precise purchases, however with information scattered throughout a number of sources, the place do they start? That is the place information lineage turns into invaluable. To carry out their evaluation, they want information from two main sources:

  • Clickstream information saved in Amazon S3 (in JSON or Parquet format)
  • Transactional order information saved as objects in Amazon DynamoDB

To make these datasets discoverable throughout the enterprise, you could:

  1. Create a challenge in SageMaker Unified Studio that shall be used to supply and handle the datasets
  2. Allow information lineage seize within the SageMaker Unified Studio challenge
  3. Arrange the assets for this use case, which incorporates an AWS Glue information supply (arrange in SageMaker Unified Studio) and AWS Glue crawler (arrange in AWS Glue)
  4. Run the AWS Glue crawler to catalog the datasets in AWS Glue Information Catalog
  5. Supply the metadata of the information property into the SageMaker Catalog by working the information supply
  6. Use SageMaker Unified Studio to navigate by means of the lineage of the information property and visualize their origin
  7. Perceive how schema evolution is captured within the information asset’s lineage

Stipulations

To finish the steps on this submit, you want an SageMaker Unified Studio area already deployed in your AWS account. To get began shortly in a testing surroundings, we advise creating your SageMaker area utilizing the short setup possibility as defined in Create an Amazon SageMaker Unified Studio area – fast setup.

Resolution steps

To seize information lineage for AWS Glue tables managed with AWS Glue crawlers utilizing SageMaker Unified Studio, full the steps within the following sections.

Arrange a SageMaker challenge with SQL functionality

In SageMaker Unified Studio, a challenge profile defines an uber template for initiatives in your Amazon SageMaker unified area. By organising a challenge with the proper tooling (challenge profile), you’ll provision assets you should use to work with information, which could embody cataloging it in SageMaker, remodeling it into new information property, analyzing it to drive enterprise worth, and even use it for ML or AI purposes.

To exhibit information lineage successfully, we use SageMaker SQL analytics challenge profile for a streamlined setup. Though this profile provides complete information analytics capabilities, we focus particularly on two key elements:

  • AWS Glue database – A lakehouse for storing and managing technical metadata
  • Information supply job – Robotically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complicated handbook configurations so we are able to concentrate on the core ideas of information lineage.

To create a brand new challenge in your SageMaker area utilizing the SQL analytics challenge profile, comply with the steps detailed in SQL analytics challenge profile. Maintain all default configurations when creating the challenge.

After creating your challenge in SageMaker Studio, you’ll unlock highly effective information lineage capabilities that make monitoring and understanding your information flows intuitive. By means of the information sourcing characteristic, you may simply monitor how information strikes from supply to the AWS Glue database. This visibility turns into significantly invaluable when debugging information points—you may shortly hint information again to its supply, perceive how adjustments impression downstream processes, and establish affected analyses or experiences. Subsequent, populate the AWS Glue database with pattern information to watch these options in motion and exhibit how they’ll streamline your information operations.

For additional steerage on tips on how to entry the main points of the brand new SageMaker challenge, discuss with Get challenge particulars. After you entry the information supply particulars, within the Database identify discipline, pay attention to the AWS Glue database identify related to the SageMaker challenge.

Allow information lineage seize within the SageMaker challenge’s information supply

To allow lineage seize, comply with these steps:

  1. Broaden the Actions menu, then select Edit information supply.
  2. Go to the connections and choose Import information lineage to configure lineage seize from the supply, as proven within the following screenshot.
  3. Make different adjustments to the information supply fields as desired, then select Save.

Enabling lineage will be certain the information supply job will seize lineage within the subsequent run.

Deploy assets for the use case

Comply with these steps:

  1. To deploy the assets required for this submit, obtain the AWS CloudFormation template amazon-datazone-examples within the AWS Samples GitHub repository. Deploy it in your AWS account.

For additional steerage on tips on how to deploy a CloudFormation stack, discuss with Create a stack from the CloudFormation console. It is advisable to present a Stack identify and the identify of the AWS GlueDatabaseName related to the challenge of your SageMaker area, as proven within the following screenshot.

  1. Select Subsequent.

The template will deploy the next assets:

  • A S3 bucket with a pattern file of clickstream information. The bucket identify and placement of the file will comply with the trail sample s3://ecomm-analytics-<ACCOUNT_ID>-<REGION>/clickstream/<YYYY>/<MM>/<DD>/information.json. The file will include a pattern document with the next construction:
{
    "session_id": "abc123",
    "user_id": "u789",
    "event_type": "product_view",
    "product_id": "prod456",
    "timestamp": "2025-06-04T09:23:12Z"
}

  • A DynamoDB desk with a pattern merchandise of order information (transactions). The desk shall be named OrderTransactionTable. The pattern merchandise could have the next construction:
{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z"
}

  • An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB desk deployed as a part of the stack and retailer the metadata within the AWS Glue database related to the SageMaker challenge. You’ll be able to entry the crawler’s particulars within the AWS console, as proven within the following screenshot.

Run the AWS Glue crawler

The AWS Glue crawler deployed within the earlier step will mean you can seize metadata from the 2 information sources, Amazon S3 and DynamoDB, and retailer it in AWS Glue Information Catalog, particularly within the database related to the SageMaker challenge. After the metadata is saved, it will likely be accessible to SageMaker.

Earlier than working the crawler, you could present AWS Lake Formation permissions to the IAM function that the AWS Glue crawler will use to work together together with your information supply and goal AWS Glue database. The next command will grant the permissions wanted for the crawler to retailer metadata into the AWS Glue database of the SageMaker challenge.

To invoke this command, we advocate utilizing AWS CloudShell on the AWS console as defined in AWS CloudShell Ideas. Replace the <REGION>, <ACCOUNT_ID> and <GLUE_DATABASE_NAME> placeholders with the proper values in your AWS Area, AWS account ID, and identify of the AWS Glue database related to the SageMaker challenge.

aws lakeformation grant-permissions 
  --region  
  --principal DataLakePrincipalIdentifier=arn:aws:iam:::function/glue-crawler-role  
  --permissions CREATE_TABLE 
  --resource '{ "Database": { "Identify": "" } }'
  

Subsequent, run the AWS Glue Crawler on the AWS console. After the crawler efficiently finishes, two new tables, clickstream and ordertransactiontable, shall be created within the AWS Glue database related to the SageMaker challenge. Check with Viewing crawler outcomes and particulars to study extra about AWS Glue crawler outcomes.

Supply metadata from the AWS Glue database into SageMaker

To supply metadata from information property within the AWS Glue database, together with their lineage, into SageMaker, use the information supply that was deployed as a part of the SageMaker challenge creation.

  1. To run the information supply, go to the information supply particulars web page.
  2. Select Run. (Information sources may be scheduled to run as nicely, nevertheless, for this demonstration we set off a handbook run).

After the information supply run is full, metadata from each information property within the AWS Glue database shall be imported into the SageMaker area because the challenge’s stock property. Yow will discover the main points of the information supply run from inside SageMaker Unified Studio, which embody:

  • The info property from the AWS Glue database that had been ingested into SageMaker.
  • The standing of the information lineage import for every information asset, which incorporates an occasion ID for traceability. This lineage occasion ID can be utilized to debug inconsistencies within the ensuing lineage graph. You should use the GetLineageEvent API to retrieve the uncooked payload of the lineage occasion.

Visualizing the information lineage graph of the information property in SageMaker Unified Studio

With SageMaker Unified Studio, you have got a single place to handle and uncover information property. When accessing an information asset printed within the SageMaker central catalog or in your challenge’s personal stock, you may dive into the asset’s metadata, which incorporates its schema, enterprise description, customized metadata varieties, high quality, lineage, and extra. To visualise the lineage graph of every information asset of this submit, comply with these steps:

  1. In SageMaker Studio, navigate to the Belongings part of the SageMaker challenge particulars web page and select INVENTORY
  2. Choose the asset that you just need to discover. You can even entry the asset instantly from the information supply run by deciding on the asset identify.
  3. To view the lineage graph of the information asset as much as its origin, proven within the following screenshots, select the LINEAGE tab.
    • For clickstream desk (Sourced from S3)

    • For order transactions desk (Sourced from DynamoDB)

With lineage, now you can verify that the information originated from sources comparable to Amazon S3 and Amazon DynamoDB and perceive the way it has been reworked alongside the best way. Due to this end-to-end visibility, you may belief the information, make knowledgeable choices, and supply compliance with confidence. The lineage graph captures important metadata that varieties the inspiration of lineage monitoring.

  • This consists of desk schemas, column definitions and their information varieties.
  • Column-level lineage turns into significantly highly effective on this context. Think about your clickstream’s AWS Glue desk powers an Amazon QuickSight dashboard analyzing buyer buy patterns and spot discrepancies in your income experiences. With column lineage, you may immediately hint the supply of these columns.
  • This granular visibility not solely accelerates debugging but in addition proves invaluable throughout schema adjustments, as we present within the following part by altering the supply schema.
  • The crawler particulars comparable to crawlerRunId (current within the supply identifier of the lineage node) and crawler begin and finish occasions can be utilized to debug which crawler runs up to date the desk.

Understanding your information asset’s schema evolution by means of lineage in SageMaker Unified Studio

Think about the order transactions supply in DynamoDB was up to date with new data. As a result of this supply powers an Amazon QuickSight report for the shopper utilizing the AWS Glue database desk, it’s vital for shoppers to know what adjustments within the information pipeline up to date the report.

  1. Edit the DynamoDB desk merchandise with extra columns to find out how lineage graph can be utilized to view historic updates:
{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z",
	"customerSegment": "new-customer",
    "conversionSource": "primeDayEmailCampaign"
}

  1. Enter the OrderTransactionsCrawler Glue crawler once more on the AWS console. After completion, you’ll discover that it up to date the ordertransactiontable AWS Glue desk, as proven within the following screenshot.

  1. Run once more the information supply related to the challenge in SageMaker Unified Studio to import the most recent metadata into the SageMaker Catalog. After completion, you’ll discover the information supply up to date the ordertransactiontable information asset within the SageMaker Catalog, as proven within the following screenshot.

This part explores how lineage may be helpful to trace the updates.

Navigate to the ordertransactiontable information asset in SageMaker Catalog by deciding on it from the information supply run and select the LINEAGE tab, as proven within the following screenshot.

Discover how the brand new columns can be found within the lineage graph. A brand new crawler run ID is current because the supply identifier of the crawler lineage node. The historical past tab reveals a number of crawler runs. You’ll be able to navigate to examine the state of the system in the course of the first run.

Cleanup

After you’re carried out, we advocate to cleansing up the assets created for this submit to keep away from unintended costs:

  1. Delete the stock property that had been cataloged within the SageMaker challenge’s stock, as defined in Delete an Amazon SageMaker Unified Studio asset.
  2. Delete the SageMaker challenge that was created as a part of this submit, as defined in Delete a challenge.
  3. Delete the CloudFormation stack that was deployed as a part of this submit, as defined in Delete a stack from the CloudFormation console.
  4. The S3 bucket created as a part of the CloudFormation stack will stay after its deletion as a result of it accommodates an information file in it. Empty and delete the bucket, as defined in Deleting a normal goal bucket.

Conclusion

On this submit, you had been in a position to discover the information lineage capabilities of Amazon SageMaker, particularly when working with AWS Glue crawlers. You discovered how one can arrange an AWS Glue crawler to deduce metadata from information property in a number of sources comparable to Amazon S3 and DynamoDB and retailer it the AWS Glue Information Catalog. You additionally imported this metadata, together with information lineage, into Amazon SageMaker by means of the information supply functionality of a SageMaker challenge. Lastly, you explored the ensuing lineage graph of information property in SageMaker Unified Studio and noticed among the functionalities accessible to know the origin path of them, perceive how columns are reworked, and what impression seems like when performing adjustments to any step of the pipeline.We encourage you to now check the capabilities you explored on this submit with your personal information. By following the sample introduced on this submit, many shoppers have been in a position to obtain governance of their information lake and lakehouse platforms on high of Amazon SageMaker with information lineage and extra.


In regards to the authors

Mohit Dawar is a Senior Software program Engineer at Amazon Internet Companies (AWS) engaged on Amazon DataZone. Over the previous 3 years, he has led efforts across the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys engaged on large-scale distributed methods, experimenting with AI to enhance person expertise, and constructing instruments that make information governance really feel easy. Join with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Options Architect for Startups at Amazon Internet Companies (AWS) based mostly in Austin, TX, US. He’s keen about serving to prospects architect fashionable platforms at scale for information, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for frequent complicated issues in order that prospects can speed up their cloud journey and undertake greatest practices. Join with him on LinkedIn: Jose Romero.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles