Amazon Redshift has established itself as a extremely scalable, absolutely managed cloud knowledge warehouse trusted by tens of hundreds of shoppers for its superior price-performance and superior knowledge analytics capabilities. Pushed primarily by buyer suggestions, the product roadmap for Amazon Redshift is designed to verify the service constantly evolves to fulfill the ever-changing wants of its customers.
Through the years, this customer-centric method has led to the introduction of groundbreaking options reminiscent of zero-ETL, knowledge sharing, streaming ingestion, knowledge lake integration, Amazon Redshift ML, Amazon Q generative SQL, and transactional knowledge lake capabilities. The most recent innovation in Amazon Redshift knowledge sharing capabilities additional enhances the service’s flexibility and collaboration potential.
Amazon Redshift now permits the safe sharing of knowledge lake tables—also referred to as exterior tables or Amazon Redshift Spectrum tables—which might be managed within the AWS Glue Information Catalog, in addition to Redshift views referencing these knowledge lake tables. This breakthrough empowers knowledge analytics to span the total breadth of shareable knowledge, permitting you to seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and AWS Areas—with out the overhead of bodily knowledge motion or recreating safety insurance policies for knowledge lake tables and Redshift views on every warehouse.
Through the use of granular entry controls, knowledge sharing in Amazon Redshift helps knowledge homeowners preserve tight governance over who can entry the shared info. On this publish, we discover highly effective use instances that display how one can improve cross-team and cross-organizational collaboration, cut back overhead, and unlock new insights through the use of this modern knowledge sharing performance.
Overview of Amazon Redshift knowledge sharing
Amazon Redshift knowledge sharing permits you to securely share your knowledge with different Redshift warehouses, with out having to repeat or transfer the information.
Information shared between warehouses doesn’t require the information to be bodily copied or moved—as a substitute, knowledge stays within the authentic Redshift warehouse, and entry is granted to different licensed customers as a part of a one-time setup. Information sharing gives granular entry management, permitting you to manage which particular tables or views are shared, and which customers or companies can entry the shared knowledge.
Since shoppers entry the shared knowledge in-place, they all the time entry the newest state of the shared knowledge. Information sharing even permits for the automated sharing of latest tables created after that datashare was established.
You possibly can share knowledge throughout totally different Redshift warehouses inside or throughout AWS accounts, and it’s also possible to do cross-region knowledge sharing. This lets you share knowledge with companions, subsidiaries, or different elements of your group, and permits the highly effective workload isolation use case, as proven within the following diagram. With the seamless integration of Amazon Redshift with AWS Information Change, knowledge will also be monetized and shared publicly, and public datasets reminiscent of census knowledge may be added to a Redshift warehouse with only a few steps.
The info sharing capabilities in Amazon Redshift additionally allow the implementation of an information mesh structure, as proven within the following diagram. This helps democratize knowledge throughout the group by decreasing obstacles to accessing and utilizing knowledge throughout totally different enterprise models and groups. For datasets with a number of authors, Amazon Redshift knowledge sharing helps each learn and write use instances (write in preview on the time of writing). This allows the creation of 360-degree datasets, reminiscent of a buyer dataset that receives contributions from a number of Redshift warehouses throughout totally different enterprise models within the group.
Overview of Redshift Spectrum and knowledge lake tables
Within the fashionable knowledge group, the information lake has emerged as a centralized repository—a single supply of fact the place all knowledge throughout the group in the end resides sooner or later in its lifecycle. Redshift Spectrum permits seamless integration between the Redshift knowledge warehouse and clients’ knowledge lakes, as proven within the following diagram. With Redshift Spectrum, you’ll be able to run SQL queries instantly towards knowledge saved in Amazon Easy Storage Service (Amazon S3), with out the necessity to first load that knowledge right into a Redshift warehouse. This lets you preserve a complete view of your knowledge whereas optimizing for cost-efficiency.
Redshift Spectrum helps a wide range of open file codecs, together with Parquet, ORC, JSON, and CSV, in addition to open desk codecs reminiscent of Apache Iceberg, all saved in Amazon S3. It runs these queries utilizing a devoted fleet of high-performance servers with low-latency connections to the S3 knowledge lake. Information lake tables may be added to a Redshift warehouse both routinely by the Information Catalog, within the Amazon Redshift Question Editor, or manually utilizing SQL instructions.
From a consumer expertise standpoint, there’s little distinction between querying a neighborhood Redshift desk vs. an information lake desk. SQL queries may be reused verbatim to carry out the identical aggregations and transformations on knowledge residing within the knowledge lake, as proven within the following examples. Moreover, through the use of columnar file codecs like Parquet and pushing down question predicates, you’ll be able to obtain additional efficiency enhancements.
The next SQL is for a pattern question towards native Redshift tables:
The next SQL is for a similar question, however towards knowledge lake tables:
To keep up sturdy knowledge governance, Redshift Spectrum integrates with AWS Lake Formation, enabling the constant software of safety insurance policies and entry controls throughout each the Redshift knowledge warehouse and S3 knowledge lake. When Lake Formation is used, Redshift producer warehouses first share their knowledge with Lake Formation reasonably than instantly with different Redshift shopper warehouses, and the information lake administrator grants fine-grained permissions for Redshift shopper warehouses to entry the shared knowledge. For extra info, see Centrally handle entry and permissions for Amazon Redshift knowledge sharing with AWS Lake Formation.
Prior to now, nevertheless, sharing knowledge lake tables throughout Redshift warehouses introduced challenges. It wasn’t potential to take action with out having to mount the information lake tables on every particular person Redshift warehouse after which recreate the associated safety insurance policies.
This barrier has now been addressed with the introduction of knowledge sharing assist for knowledge lake tables. Now you can share knowledge lake tables identical to some other desk, utilizing the built-in knowledge sharing capabilities of Amazon Redshift. By combining the facility of Redshift Spectrum knowledge lake integration with the pliability of Amazon Redshift knowledge sharing, organizations can unlock new ranges of cross-team collaboration and insights, whereas sustaining sturdy knowledge governance and safety controls.
For extra details about Redshift Spectrum, see Getting began with Amazon Redshift Spectrum.
Answer overview
On this publish, we describe add knowledge lake tables or views to a Redshift datashare, masking two key use instances:
- Including a late-binding view or materialized view to a producer datashare that references an information lake desk
- Including an information lake desk on to a producer datashare
The primary use case gives higher flexibility and comfort. Customers can question the shared view with out having to configure fine-grained permissions. The configuration, reminiscent of defining permissions on knowledge saved in Amazon S3 with Lake Formation, is already dealt with on the producer facet. You solely want so as to add the view to the producer datashare one time, making it a handy choice for each the producer and the buyer.
A further advantage of this method is that you could add views to a datashare that be part of knowledge lake tables with native Redshift tables. When these views are shared, you’ll be able to relegate the trusted enterprise logic to only the producer facet.
Alternatively, you’ll be able to add knowledge lake tables on to a datashare. On this case, shoppers can question the information lake tables instantly or be part of them with their very own native tables, permitting them so as to add their very own conditional logic as wanted.
Add a view that references an information lake desk to a Redshift datashare
If you create knowledge lake tables that you simply intend so as to add to a datashare, the really useful and most typical means to do that is so as to add a view to the datashare that references an information lake desk or tables. There are three high-level steps concerned:
- Add the Redshift view’s schema (the native schema) to the Redshift datashare.
- Add the Redshift view (the native view) to the Redshift datashare.
- Add the Redshift exterior schemas (for the tables referenced by the Redshift view) to the Redshift datashare.
The next diagram illustrates the total workflow.
The workflow consists of the next steps:
- Create an information lake desk on the datashare producer. For extra info on creating Redshift Spectrum objects, see Exterior schemas for Amazon Redshift Spectrum. Information lake tables to be shared can embody Lake Formation registered tables and Information Catalog tables, and if utilizing the Redshift Question Editor, these tables are routinely mounted.
- Create a view on the producer that references the information lake desk that you simply created.
- Create a datashare, if one doesn’t exist already, and add objects to your datashare, together with the view you created that references the information lake desk. For extra info, see Creating datashares and including objects (preview).
- Add the exterior schema of the bottom Redshift desk to the datashare (that is true of each native base tables and knowledge lake tables). You don’t have so as to add an information lake desk itself to the datashare.
- On the buyer, the administrator makes the view out there to shopper database customers.
- Database shopper customers can write queries to retrieve knowledge from the shared view and be part of it with different tables and views on the buyer.
After these steps are full, database shopper customers with entry to the datashare views can reference them of their SQL queries. The next SQL queries are examples for reaching the previous steps.
Create an information lake desk on the producer warehouse:
Create a view on the producer warehouse:
Add a view to the datashare on the producer warehouse:
Create a shopper datashare and grant permissions for the view within the shopper warehouse:
Add an information lake desk on to a Redshift datashare
Including an information lake desk to a datashare is just like including a view. This course of works effectively for a case the place the shoppers need the uncooked knowledge from the information lake desk and so they need to write queries and be part of it to tables in their very own knowledge warehouse. There are two high-level steps concerned:
- Add the Redshift exterior schemas (of the information lake tables to be shared) to the Redshift datashare.
- Add the information lake desk (the Redshift exterior desk) to the Redshift datashare.
The next diagram illustrates the total workflow.
The workflow consists of the next steps:
- Create an information lake desk on the datashare producer.
- Add objects to your datashare, together with the information lake desk you created. On this case, you don’t have any abstraction over the desk.
- On the buyer, the administrator makes the desk out there.
- Database shopper customers can write queries to retrieve knowledge from the shared desk and be part of it with different tables and views on the buyer.
The next SQL queries are examples for reaching the previous producer steps.
Create an information lake desk on the producer warehouse:
Add an information lake schema and desk on to the datashare on the producer warehouse:
Create a shopper datashare and grant permissions for the view within the shopper warehouse:
Safety issues for sharing knowledge lake tables and views
Information lake tables are saved exterior of Amazon Redshift, within the knowledge lake, and might not be owned by the Redshift warehouse, however are nonetheless referenced inside Amazon Redshift. This setup requires particular safety issues. Information lake tables function beneath the safety and governance of each Amazon Redshift and the information lake. For Lake Formation registered tables particularly, the Amazon S3 sources are secured by Lake Formation and made out there to shoppers utilizing the supplied credentials.
The info proprietor of the information within the knowledge lake tables might need to impose restrictions on which exterior objects may be added to a datashare. To provide knowledge homeowners extra management over whether or not warehouse customers can share knowledge lake tables, you should use session tags in AWS Id and Entry Administration (IAM). These tags present extra context in regards to the consumer working the queries. For extra particulars on tagging sources, confer with Tags for AWS Id and Entry Administration sources.
Audit issues for sharing knowledge lake tables and views
When sharing knowledge lake objects by a datashare, there are particular logging issues to remember:
- Entry controls – It’s also possible to use CloudTrail log knowledge together with IAM insurance policies to manage entry to shared tables, together with each Redshift datashare producers and shoppers. The CloudTrail logs report particulars about who accesses shared tables. The identifiers within the log knowledge can be found within the
ExternalId
discipline beneath theAssumeRole
CloudTrail logs. The info proprietor can configure extra limitations on knowledge entry in an IAM coverage via actions. For extra details about defining knowledge entry by insurance policies, see Entry to AWS accounts owned by third events. - Centralized entry – Amazon S3 sources reminiscent of knowledge lake tables may be registered and centrally managed with Lake Formation. After they’re registered with Lake Formation, Amazon S3 sources are secured and ruled by the related Lake Formation insurance policies and made out there utilizing the credentials supplied by Lake Formation.
Billing issues for sharing knowledge lake tables and views
The billing mannequin for Redshift Spectrum differs for Amazon Redshift provisioned and serverless warehouses. For provisioned warehouses, Redshift Spectrum queries (queries involving knowledge lake tables) are billed based mostly on the quantity of knowledge scanned throughout question execution. For serverless warehouses, knowledge lake queries are billed the identical as non-data-lake queries. Storage for knowledge lake tables is all the time billed to the AWS account related to the Amazon S3 knowledge.
Within the case of datashares involving knowledge lake tables, prices are attributed for storing and scanning knowledge lake objects in a datashare as follows:
- When a shopper queries shared objects from an information lake, the price of scanning is billed to the buyer:
- When the buyer is a provisioned warehouse, Amazon Redshift makes use of Redshift Spectrum to scan the Amazon S3 knowledge. Subsequently, the Redshift Spectrum price is billed to the buyer account.
- When the buyer is an Amazon Redshift Serverless workgroup, there is no such thing as a separate cost for knowledge lake queries.
- Amazon S3 prices for storage and operations, reminiscent of itemizing buckets, is billed to the account that owns every S3 bucket.
For detailed info on Redshift Spectrum billing, confer with Amazon Redshift pricing and Billing for storage.
Conclusion
On this publish, we explored how Amazon Redshift enhanced knowledge sharing capabilities, together with assist for sharing knowledge lake tables and Redshift views that reference these knowledge lake tables, empower organizations to unlock the total potential of their knowledge by bringing the total breadth of knowledge property in scope for superior analytics. Organizations are actually in a position to seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and Areas.
We outlined the steps to securely share knowledge lake tables and views that reference these knowledge lake tables throughout Redshift warehouses, even these in separate AWS accounts or Areas. Moreover, we lined some issues and greatest practices to remember when utilizing this modern function.
Sharing knowledge lake tables and views by Amazon Redshift knowledge sharing champions the fashionable, data-driven group’s objective to democratize knowledge entry in a safe, scalable, and environment friendly method. By eliminating the necessity for bodily knowledge motion or duplication, this functionality reduces overhead and permits seamless cross-team and cross-organizational collaboration. Unleashing the total potential of your knowledge analytics to span the total breadth of your native tables and knowledge lake tables is only a few steps away.
For extra info on Amazon Redshift knowledge sharing and the way it can profit your group, confer with the next sources:
Please additionally attain out to your AWS technical account supervisor or AWS account Options Architect. They are going to be pleased to offer extra steerage and assist.
Concerning the Authors
Mohammed Alkateb is an Engineering Supervisor at Amazon Redshift. Previous to becoming a member of Amazon, Mohammed had 12 years of trade expertise in question optimization and database internals as a person contributor and engineering supervisor. Mohammed has 18 US patents, and he has publications in analysis and industrial tracks of premier database conferences together with EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Laptop Science from The College of Vermont, and MSc and BSc levels in Info Techniques from Cairo College.
Ramchandra Anil Kulkarni is a software program growth engineer who has been with Amazon Redshift for over 4 years. He’s pushed to develop database improvements that serve AWS clients globally. Kulkarni’s long-standing tenure and dedication to the Amazon Redshift service display his deep experience and dedication to delivering cutting-edge database options that empower AWS clients worldwide.
Mark Lyons is a Principal Product Supervisor on the Amazon Redshift crew. He works on the intersection of knowledge lakes and knowledge warehouses. Previous to becoming a member of AWS, Mark held product management roles with Dremio and Vertica. He’s captivated with knowledge analytics and empowering clients to vary the world with their knowledge.
Asser Moustafa is a Principal Worldwide Specialist Options Architect at AWS, based mostly in Dallas, Texas. He companions with clients worldwide, advising them on all facets of their knowledge architectures, migrations, and strategic knowledge visions to assist organizations undertake cloud-based options, maximize the worth of their knowledge property, modernize legacy infrastructures, and implement cutting-edge capabilities like machine studying and superior analytics. Previous to becoming a member of AWS, Asser held numerous knowledge and analytics management roles, finishing an MBA from New York College and an MS in Laptop Science from Columbia College in New York. He’s captivated with empowering organizations to grow to be really data-driven and unlock the transformative potential of their knowledge.