-8.1 C
New York
Monday, December 23, 2024

Use the AWS CDK with the Knowledge Options Framework to provision and handle Amazon Redshift Serverless


In February 2024, we introduced the discharge of the Knowledge Options Framework (DSF), an opinionated open supply framework for constructing information options on AWS. DSF is constructed utilizing the AWS Cloud Improvement Equipment (AWS CDK) to package deal infrastructure elements into L3 AWS CDK constructs on high of AWS providers. L3 constructs are implementations of widespread technical patterns and create a number of sources which are configured to work with one another.

On this submit, we exhibit the best way to use the AWS CDK and DSF to create a multi-data warehouse platform based mostly on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of information, and information sharing between completely different information warehouse deployments. Utilizing a programmatic method with the AWS CDK and DSF lets you apply GitOps rules to your analytics workloads and understand the next advantages:

  • You possibly can deploy utilizing steady integration and supply (CI/CD) pipelines, together with the definitions of Redshift objects (databases, tables, shares, and so forth)
  • You possibly can roll out adjustments constantly throughout a number of environments
  • You possibly can bootstrap information warehouses (desk creation, ingestion of information, and so forth) utilizing code and use model management to simplify the setup of testing environments
  • You possibly can take a look at adjustments earlier than deployment utilizing AWS CDK built-in testing capabilities

As well as, DSF’s Redshift Serverless L3 constructs present quite a few built-in capabilities that may speed up improvement whereas serving to you comply with greatest practices. For instance:

  • Working extract, rework, and cargo (ETL) jobs to and from Amazon Redshift is extra simple as a result of an AWS Glue connection useful resource is routinely created and configured. This implies information engineers don’t need to configure this useful resource and may use it immediately with their AWS Glue ETL jobs.
  • Equally, with discovery of information inside Amazon Redshift, DSF supplies a handy technique to configure an AWS Glue crawler to populate the AWS Glue Knowledge Catalog for ease of discovery in addition to ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler makes use of an AWS Id and Entry Administration (IAM) position that follows least privilege.
  • Sharing information between Redshift information warehouses is a typical method to enhance collaboration between strains of enterprise with out duplicating information. DSF supplies handy strategies for the end-to-end movement for each information producer and shopper.

Answer overview

The answer demonstrates a typical sample the place an information warehouse is used as a serving layer for enterprise intelligence (BI) workloads on high of information lake information. The supply information is saved in Amazon Easy Storage Service (Amazon S3) buckets, then ingested right into a Redshift producer information warehouse to create materialized views and combination information, and at last shared with a Redshift shopper operating BI queries from the end-users. The next diagram illustrates the high-level structure.

Use the AWS CDK with the Knowledge Options Framework to provision and handle Amazon Redshift Serverless

Within the submit, we use Python for the instance code. DSF additionally helps TypeScript.

Conditions

As a result of we’re utilizing the AWS CDK, full the steps in Getting Began with the AWS CDK earlier than you implement the answer.

Initialize the mission and provision a Redshift Serverless namespace and workgroup

Let’s begin with initializing the mission and together with DSF as a dependency. You possibly can run this code in your native terminal, or you should use AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the mission folder in your IDE and full the next steps:

  1. Open the app.py file.
  2. On this file, be certain that to uncomment the primary env This configures the AWS CDK atmosphere relying on the AWS profile used throughout the deployment.
  3. Add a configuration flag within the cdk.context.json file on the root of the mission (if it doesn’t exist, create the file):
    {  
        "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
    }

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes certain sources which have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. This can be a guardrail DSF makes use of to stop unintentionally deleting information.

Now that the mission is configured, you can begin including sources to the stack.

  1. Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

That is the place we configure the sources to be deployed.

  1. To get began constructing the end-to-end demo, add the next import statements on the high of the file, which lets you begin defining the sources from each the AWS CDK core library in addition to DSF:
    from aws_cdk import (
        RemovalPolicy,
        Stack
    )
    
    from aws_cdk.aws_s3 import Bucket
    from aws_cdk.aws_iam import Function, ServicePrincipal
    from constructs import Assemble
    from cdklabs import aws_data_solutions_framework as dsf

We use a number of DSF-specific constructs to construct the demo:

  • DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to characterize the completely different information layers.
  • S3DataCopy – This manages the copying of information from one bucket to a different bucket.
  • RedshiftServerlessNamespace – This creates a Redshift Serverless namespace the place database objects and customers are saved.
  • RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that comprises compute- and network-related configurations for the info warehouse. That is additionally the entry level for a number of handy functionalities that DSF supplies, resembling cataloging of Redshift tables, operating SQL statements as a part of the AWS CDK (resembling creating tables, information ingestion, merging of tables, and extra), and sharing datasets throughout completely different Redshift clusters with out transferring information.
  1. Now that you’ve got imported the libraries, create a set of S3 buckets following the medallion structure greatest practices with bronze, silver, and gold information layers.

The high-level definitions of every layer are as follows:

  • Bronze represents uncooked information; that is the place information from varied supply programs lands. No schema is required.
  • Silver is cleaned and doubtlessly augmented information. The schema is enforced on this layer.
  • Gold is information that’s additional refined and aggregated to serve a particular enterprise want.

Utilizing the DataLakeStorage assemble, you possibly can create these three S3 buckets with the next greatest practices:

  • Encryption at relaxation by AWS Key Administration Service (AWS KMS) is turned on
  • SSL is enforced
  • The usage of S3 bucket keys is turned on
  • There’s a default S3 lifecycle rule outlined to delete incomplete multipart uploads after 1 day
    data_lake = dsf.storage.DataLakeStorage(self,
        'DataLake',
        removal_policy=RemovalPolicy.DESTROY)

  1. After you create the S3 buckets, copy over the info utilizing the S3DataCopy For this demo, we land the info within the Silver bucket as a result of it’s already cleaned:
    source_bucket = Bucket.from_bucket_name(self, 
        'SourceBucket', 
        bucket_name="redshift-immersionday-labs")
    
    data_copy = dsf.utils.S3DataCopy(self,
        'SourceData', 
        source_bucket=source_bucket, 
        source_bucket_prefix='information/amazon-reviews/', 
        source_bucket_region='us-west-2', 
        target_bucket=data_lake.silver_bucket, 
        target_bucket_prefix='silver/amazon-reviews/')

  2. To ensure that Amazon Redshift to ingest the info in Amazon S3, it wants an IAM position with the correct permissions. This position can be related to the Redshift Serverless namespace that you simply create subsequent.
    lake_role = Function(self, 
        'LakeRole', 
        assumed_by=ServicePrincipal('redshift.amazonaws.com'))
    
    data_lake.silver_bucket.grant_read(lake_role)

  3. To provision Redshift Serverless, configure two sources: a namespace and a workgroup. DSF supplies L3 constructs for each:
    1. RedshiftServerlessNamespace
    2. RedshiftServerlessWorkgroup

    Each constructs comply with safety greatest practices, together with:

    • The default digital personal cloud (VPC) makes use of personal subnets (with public entry disabled).
    • Knowledge is encrypted at relaxation by AWS KMS with computerized key rotation.
    • Admin credentials are saved in AWS Secrets and techniques Supervisor with computerized rotation managed by Amazon Redshift.
    • A default AWS Glue connection is routinely created utilizing personal connectivity. This can be utilized by AWS Glue crawlers in addition to AWS Glue ETL jobs to connect with Amazon Redshift.

    The RedshiftServerlessWorkgroup assemble is the principle entry level for different capabilities, resembling integration with the AWS Glue Knowledge Catalog, Redshift Knowledge API, and Knowledge Sharing API.

    1. Within the following instance, use the defaults offered by the assemble and affiliate the IAM position that you simply created earlier to offer Amazon Redshift entry to the info lake for information ingestion:
      namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
          'Namespace', 
          db_name="defaultdb", 
          title="producer", 
          removal_policy=RemovalPolicy.DESTROY, 
          default_iam_role=lake_role)
      
      workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
          'Workgroup', 
          title="producer", 
          namespace=namespace, 
          removal_policy=RemovalPolicy.DESTROY)

Create tables and ingest information

To create a desk, you should use the runCustomSQL technique within the RedshiftServerlessWorkgroup assemble. This technique lets you run arbitrary SQL statements when the useful resource is being created (resembling create desk or create materialized view) and when it’s being deleted (resembling drop desk or drop materialized view).

Add the next code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name="defaultdb", 
    sql="CREATE TABLE amazon_reviews (market character various(16383) ENCODE lzo, customer_id character various(16383) ENCODE lzo, review_id character various(16383) ENCODE lzo, product_id character various(16383) ENCODE lzo, product_parent character various(16383) ENCODE lzo, product_title character various(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character various(16383) ENCODE lzo, verified_purchase character various(16383) ENCODE lzo, review_headline character various(max) ENCODE lzo, review_body character various(max) ENCODE lzo, review_date date ENCODE az64, yr integer ENCODE az64) DISTSTYLE AUTO;", 
    delete_sql="drop desk amazon_reviews")

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of a number of the useful resource creation, we additionally implement dependencies between some sources; in any other case, the AWS CDK would attempt to create them in parallel to speed up the deployment. The previous dependency statements set up the next:

  • Earlier than you load the info, the S3 information copy is full, so the info exists within the supply bucket of the ingestion
  • Earlier than you load the info, the goal desk has been created within the Redshift namespace

Bootstrapping instance (materialized views)

The workgroup.run_custom_sql() technique supplies flexibility in how one can bootstrap your Redshift information warehouse utilizing the AWS CDK. For instance, you possibly can create a materialized view to enhance the queries’ efficiency by pre-aggregating information from the Amazon opinions:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name="defaultdb",
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS ranking FROM amazon_reviews WHERE market="US" GROUP BY 1,2;''',
    delete_sql="drop materialized view mv_product_analysis")

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup routinely creates an AWS Glue connection useful resource that can be utilized by AWS Glue crawlers and AWS Glue ETL jobs. That is straight uncovered from the workgroup assemble by the glue_connection property. Utilizing this connection, the workgroup assemble exposes a handy technique to catalog the tables contained in the related Redshift Serverless namespace. The next an instance code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database within the Knowledge Catalog named mv_product_analysis and the related crawler with the IAM position and community configuration already configured. By default, it crawls all of the tables inside the general public schema within the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter within the catalogTables technique lets you outline a sample on what to crawl (see the JDBC information retailer within the embrace path).

You possibly can run the crawler utilizing the AWS Glue console or invoke it utilizing the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK utilizing AwsCustomResource.

Knowledge sharing

DSF helps Redshift information sharing for either side (producers and customers) in addition to identical account and cross-account situations. Let’s create one other Redshift Serverless namespace and workgroup to exhibit the interplay:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    title="shopper", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    title="shopper", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, full the next steps:

  1. Create the brand new share and populate the share with the schema or tables:
    data_share = workgroup.create_share('DataSharing', 
        'defaultdb', 
        'defaultdbshare', 
        'public', ['mv_product_analysis'])
    
    data_share.new_share_custom_resource.node.add_dependency(materialized_view)
  2. Create entry grants:
    • To grant to a cluster in the identical account:
      share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
          data_share, 
          namespace2.namespace_id)
      
      share_grant.useful resource.node.add_dependency(data_share.new_share_custom_resource)
      share_grant.useful resource.node.add_dependency(namespace2)
    • To grant to a unique account:
      workgroup.grant_access_to_share('GrantToDifferentAccount', 
          tpcdsShare, 
          undefined, 
          '<ACCOUNT_ID_OF_CONSUMER>', 
          true)

The final parameter within the grant_access_to_share technique permits to routinely authorize the cross-account entry on the info share. Omitting this parameter would default to no authorization; which suggests a Redshift administrator must authorize the cross-account share both utilizing the AWS CLI, SDK, or Amazon Redshift console.

For customers

For a similar account share, to create the database from the share, use the next code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "advertising and marketing", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.useful resource.node.add_dependency(share_grant.useful resource)
create_db_from_share.useful resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is comparable, however you have to point out the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    <PRODUCER_SHARE_NAME>, 
    <PRODUCER_NAMESPACE_ID>, 
    <PRODUCER_ACCOUNT_ID>)

To see the complete working instance, comply with the directions within the accompanying GitHub repository.

Deploy the sources utilizing the AWS CDK

To deploy the sources, run the next code:

You possibly can evaluation the sources created, as proven within the following screenshot.

Verify the adjustments for the deployment to start out. Wait a couple of minutes for the mission to be deployed; you possibly can maintain observe of the deployment utilizing the AWS CLI or the AWS CloudFormation console.

When the deployment is full, it’s best to see two Redshift workgroups (one producer and one shopper).

Utilizing Amazon Redshift Question Editor v2, you possibly can log in to the producer Redshift workgroup utilizing Secrets and techniques Supervisor, as proven within the following screenshot.

Producer QEV2 Login

After you log in, you possibly can see the tables and views that you simply created utilizing DSF within the defaultdb database.

QEv2 Tables

Log in to the buyer Redshift workgroup to see the shared dataset from the producer Redshift workgroup underneath the advertising and marketing database.

Clear up

You possibly can run cdk destroy in your native terminal to delete the stack. Since you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to take away information on destroy, operating cdk destroy or deleting the stack from the AWS CloudFormation console will clear up the provisioned sources.

Conclusion

On this submit, we demonstrated the best way to use the AWS CDK together with the DSF to handle Redshift Serverless as code. Codifying the deployment of sources helps present consistency throughout a number of environments. Except for infrastructure, DSF additionally supplies capabilities to bootstrap (desk creation, ingestion of information, and extra) Amazon Redshift and handle objects, all from the AWS CDK. Which means that adjustments could be model managed, reviewed, and even unit examined.

Along with Redshift Serverless, DSF helps different AWS providers, resembling Amazon Athena, Amazon EMR, and plenty of extra. Our roadmap is publicly out there, and we stay up for your function requests, contributions, and suggestions.

You may get began utilizing DSF by following our fast begin information.


In regards to the authors


Jan Michael Go Tan is a Principal Options Architect for Amazon Net Providers. He helps clients design scalable and revolutionary options with the AWS Cloud.
Vincent Gromakowski is an Analytics Specialist Options Architect at AWS the place he enjoys fixing clients’ analytics, NoSQL, and streaming challenges. He has a powerful experience on distributed information processing engines and useful resource orchestration platform.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles