This put up is written in collaboration with Philipp Karg from BMW Group.
Companies more and more require scalable, cost-efficient architectures to course of and rework large datasets. On the BMW Group, our Cloud Effectivity Analytics (CLEA) workforce has developed a FinOps resolution to optimize prices throughout over 10,000 cloud accounts. Whereas enabling organization-wide effectivity, the workforce additionally utilized these ideas to the info structure, ensuring that CLEA itself operates frugally. After evaluating varied instruments, we constructed a serverless information transformation pipeline utilizing Amazon Athena and dbt.
This put up explores our journey, from the preliminary challenges to our present structure, and particulars the steps we took to realize a extremely environment friendly, serverless information transformation setup.
Challenges: Ranging from a inflexible and expensive setup
In our early levels, we encountered a number of inefficiencies that made scaling troublesome. We have been managing advanced schemas with large tables that required important effort in maintainability. Initially, we used Terraform to create tables and views in Athena, permitting us to handle our information infrastructure as code (IaC) and automate deployments by means of steady integration and supply (CI/CD) pipelines. Nevertheless, this technique slowed us down when altering information fashions or coping with schema adjustments, subsequently requiring excessive growth efforts.
As our resolution grew, we confronted challenges with question efficiency and prices. Every question scanned massive quantities of uncooked information, leading to elevated processing time and better Athena prices. We used views to offer a clear abstraction layer, however this masked underlying complexity as a result of seemingly easy queries in opposition to these views scanned massive volumes of uncooked information, and our partitioning technique wasn’t optimized for these entry patterns. As our datasets grew, the dearth of modularity in our information design elevated complexity, making scalability and upkeep more and more troublesome. We wanted an answer for pre-aggregating, computing, and storing question outcomes of computationally intensive transformations. The absence of sturdy testing and lineage options made it difficult to determine the basis causes of knowledge inconsistencies once they occurred.
As a part of our enterprise intelligence (BI) resolution, we used Amazon QuickSight to construct our dashboards, offering visible insights into our cloud value information. Nevertheless, our preliminary information structure led to challenges. We have been constructing dashboards on high of enormous, large datasets, with some hitting the QuickSight per-dataset SPICE restrict of 1 TB. Moreover, throughout SPICE ingest, our largest datasets required 4–5 hours of processing time on account of performing full scans every time, usually scanning over a terabyte of knowledge. This structure wasn’t serving to us be extra agile and fast whereas scaling up. The lengthy processing instances and storage limitations hindered our capacity to offer well timed insights and increase our analytics capabilities.
To handle these points, we enhanced the info structure with AWS Lambda, AWS Step Features, AWS Glue, and dbt. This instrument stack considerably enhanced our growth agility, empowering us to shortly modify and introduce new information fashions. On the similar time, we improved our general information processing effectivity with incremental masses and higher schema administration.
Answer overview
Our present structure consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We selected Athena as our major question engine for a number of strategic causes: it aligns completely with our workforce’s SQL experience, excels at querying Parquet information immediately in our information lake, and alleviates the necessity for devoted compute assets. This makes Athena a perfect match for CLEA’s structure, the place we course of round 300 GB day by day from an information lake of 15 TB, with our largest dataset containing 50 billion rows throughout as much as 400 columns. The potential of Athena to effectively question large-scale Parquet information, mixed with its serverless nature, allows us to deal with writing environment friendly transformations slightly than managing infrastructure.
The next diagram illustrates the answer structure.
Utilizing this structure, we’ve streamlined our information transformation course of utilizing dbt. In dbt, an information mannequin represents a single SQL transformation that creates both a desk or a view—primarily a constructing block of our information transformation pipeline. Our implementation contains round 400 such fashions, 50 information sources, and round 100 information exams. This setup allows seamless updates—whether or not creating new fashions, updating schemas, or modifying views—triggered just by making a pull request in our supply code repository, with the remaining dealt with robotically.
Our workflow automation contains the next options:
- Pull request – Once we create a pull request, it’s deployed to our testing atmosphere first. After passing validation and being accredited or merged, it’s deployed to manufacturing utilizing GitHub workflows. This setup allows seamless mannequin creation, schema updates, or view adjustments—triggered simply by making a pull request, with the remaining dealt with robotically.
- Cron scheduler – For nightly runs or a number of day by day runs to cut back information latency, we use scheduled GitHub workflows. This setup permits us to configure particular fashions with completely different replace methods based mostly on information wants. We will set fashions to replace incrementally (processing solely new or modified information), as views (querying with out materializing information), or as full masses (utterly refreshing the info). This flexibility optimizes processing time and useful resource utilization. We will goal solely particular folders—like supply, ready, or semantic layers—and run the dbt take a look at afterward to validate mannequin high quality.
- On demand – When including new columns or altering enterprise logic, we have to replace historic information to keep up consistency. For this, we use a backfill course of, which is a customized GitHub workflow created by our workforce. The workflow permits us to pick particular fashions, embody their upstream dependencies, and set parameters like begin and finish dates. This makes certain that adjustments are utilized precisely throughout your complete historic dataset, sustaining information consistency and integrity.
Our pipeline is organized into three major levels—Supply, Ready, and Semantic—every serving a particular goal in our information transformation journey. The Supply stage maintains uncooked information in its authentic kind. The Ready stage cleanses and standardizes this information, dealing with duties like deduplication and information kind conversions. The Semantic stage transforms this ready information into business-ready fashions aligned with our analytical wants. An extra QuickSight step handles visualization necessities. To attain low value and excessive efficiency, we use dbt fashions and SQL code to handle all transformations and schema adjustments. By implementing incremental processing methods, our fashions course of solely new or modified information slightly than reprocessing your complete dataset with every run.
The Semantic stage (to not be confused with dbt’s semantic layer function) introduces enterprise logic, remodeling information into aggregated datasets which can be immediately consumable by BMW’s Cloud Information Hub, inner CLEA dashboards, information APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step additional optimizes information by choosing solely essential columns by utilizing a column-level lineage resolution and setting a dynamic date filter with a sliding window to ingest solely related scorching information into SPICE, avoiding unused information in dashboards or studies.
This strategy aligns with BMW Group’s broader information technique, which incorporates streamlining information entry utilizing AWS Lake Formation for fine-grained entry management.
General, as a high-level construction, we’ve totally automated schema adjustments, information updates, and testing by means of GitHub pull requests and dbt instructions. This strategy allows managed deployment with sturdy model management and alter administration. Steady testing and monitoring workflows uphold information accuracy, reliability, and high quality throughout transformations, supporting environment friendly, collaborative mannequin iteration.
Key advantages of the dbt-Athena structure
To design and handle dbt fashions successfully, we use a multi-layered strategy mixed with value and efficiency optimizations. On this part, we talk about how our strategy has yielded important advantages in 5 key areas.
SQL-based, developer-friendly atmosphere
Our workforce already had sturdy SQL abilities, so dbt’s SQL-centric strategy was a pure match. As an alternative of studying a brand new language or framework, builders may instantly begin writing transformations utilizing acquainted SQL syntax with dbt. This familiarity aligns effectively with the SQL interface of Athena and, mixed with dbt’s added performance, has elevated our workforce’s productiveness.
Behind the scenes, dbt robotically handles synchronization between Amazon Easy Storage Service (Amazon S3), the AWS Glue Information Catalog, and our fashions. When we have to change a mannequin’s materialization kind—for instance, from a view to a desk—it’s so simple as updating a configuration parameter slightly than rewriting code. This flexibility has lowered our growth time dramatically, allowed us to deal with constructing higher information fashions slightly than managing infrastructure.
Agility in modeling and deployment
Documentation is essential for any information platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our information fashions. This documentation contains desk schemas, relationships between fashions, and utilization examples, enabling workforce members to grasp how fashions interconnect and how you can use them successfully.
We use dbt’s built-in testing capabilities to implement complete information high quality checks. These embody schema exams that confirm column uniqueness, referential integrity, and null constraints, in addition to customized SQL exams that validate enterprise logic and information consistency. The testing framework runs robotically on each pull request, validating information transformations at every step of our pipeline. Moreover, dbt’s dependency graph supplies a visible illustration of how our fashions interconnect, serving to us perceive the upstream and downstream impacts of any adjustments earlier than we implement them. When stakeholders want to change fashions, they’ll submit adjustments by means of pull requests, which, after they’re accredited and merged, robotically set off the mandatory information transformations by means of our CI/CD pipeline. This streamlined course of enabled us to create new information merchandise inside days in comparison with weeks and lowered ongoing upkeep work by catching points early within the growth cycle.
Athena workgroup separation
We use Athena workgroups to isolate completely different question patterns based mostly on their execution triggers and functions. Every workgroup has its personal configuration and metric reporting, permitting us to observe and optimize individually. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests by means of our Supply, Ready, and Semantic levels. The dbt-test workgroup executes automated information high quality checks throughout pull request validation and nightly builds. The QuickSight workgroup manages SPICE information ingestion queries, and the Advert-hoc workgroup helps interactive information exploration by our workforce.
Every workgroup could be configured with particular information utilization quotas, enabling groups to implement granular governance insurance policies. This separation supplies a number of advantages: it allows clear value allocation, supplies remoted monitoring of question patterns throughout completely different use instances, and helps implement information governance by means of customized workgroup settings. Amazon CloudWatch monitoring per workgroup helps us observe utilization patterns, determine question efficiency points, and modify configurations based mostly on precise wants.
Utilizing QuickSight SPICE
QuickSight SPICE (Tremendous-fast, Parallel, In-memory Calculation Engine) supplies highly effective in-memory processing capabilities that we’ve optimized for our particular use instances. Reasonably than loading total tables into SPICE, we create specialised views on high of our materialized semantic fashions. These views are fastidiously crafted to incorporate solely the mandatory columns, related metadata joins, and acceptable time filtering to have solely latest information out there in dashboards.
We’ve carried out a hybrid refresh technique for these SPICE datasets: day by day incremental updates hold the info recent, and weekly full refreshes preserve information consistency. This strategy strikes a steadiness between information freshness and processing effectivity. The result’s responsive dashboards that preserve excessive efficiency whereas maintaining processing prices beneath management.
Scalability and cost-efficiency
The serverless structure of Athena eliminates guide infrastructure administration, robotically scaling based mostly on question demand. As a result of prices are based mostly solely on the quantity of knowledge scanned by queries, optimizing queries to scan as little information as potential immediately reduces our prices. We use the distributed question execution capabilities of Athena by means of our dbt mannequin construction, enabling parallel processing throughout information partitions. By implementing efficient partitioning methods and utilizing Parquet file format, we decrease the quantity of knowledge scanned whereas maximizing question efficiency.
Our structure affords flexibility in how we materialize information by means of views, full tables, and incremental tables. With dbt’s incremental fashions and partitioning technique, we course of solely new or modified information as an alternative of total datasets. This strategy has confirmed extremely efficient—we’ve noticed important reductions in information processing quantity in addition to information scanning, significantly in our QuickSight workgroup.
The effectiveness of those optimizations carried out on the finish of 2023 is seen within the following diagram, exhibiting prices by Athena workgroups.
The workgroups are illustrated as follows:
- Inexperienced (QuickSight): Exhibits lowered information scanning post-optimization.
- Gentle blue (Advert-hoc): Varies based mostly on evaluation wants.
- Darkish blue (dbt): Maintains constant processing patterns
- Orange (dbt-test): Exhibits common, environment friendly take a look at execution.
The elevated dbt workload prices immediately correlate with decreased QuickSight prices, reflecting our architectural shift from utilizing advanced views in QuickSight workgroups (which beforehand masked question complexity however led to repeated computations) to utilizing dbt for materializing these transformations. Though this elevated the dbt workload, the general cost-efficiency improved considerably as a result of materialized tables lowered redundant computations in QuickSight. This demonstrates how our optimization methods efficiently handle rising information volumes whereas reaching web value discount by means of environment friendly information materialization patterns.
Conclusion
Our information structure makes use of dbt and Athena to offer a scalable, cost-efficient, and versatile framework for constructing and managing information transformation pipelines. Athena’s capacity to question information immediately in Amazon S3 alleviates the necessity to transfer or copy information right into a separate information warehouse, and its serverless mannequin and dbt’s incremental processing decrease each operational overhead and processing prices. Given our workforce’s sturdy SQL experience, expressing these transformations in SQL by means of dbt and Athena was a pure selection, enabling speedy mannequin growth and deployment. With dbt’s automated documentation and lineage, troubleshooting and figuring out information points is simplified, and the system’s modularity permits for fast changes to satisfy evolving enterprise wants.
Beginning with this structure is fast and easy: all that’s wanted is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, as a result of it’s a completely serverless service with seamless integration with Amazon S3. This structure is good for groups trying to quickly prototype, take a look at, and deploy information fashions, optimizing useful resource utilization, accelerating deployment, and offering high-quality, correct information processing.
For these serious about a managed resolution from dbt, see From information lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.
In regards to the Authors
Philipp Karg is a Lead FinOps Engineer at BMW Group and has a robust background in information engineering, AI, and FinOps. He focuses on driving cloud effectivity initiatives and fostering a cost-aware tradition throughout the firm to leverage the cloud sustainably.
Selman Ay is a Information Architect specializing in end-to-end information options, structure, and AI on AWS. Outdoors of labor, he enjoys enjoying tennis and fascinating out of doors actions.
Cizer Pereira is a Senior DevOps Architect at AWS Skilled Companies. He works intently with AWS clients to speed up their journey to the cloud. He has a deep ardour for cloud-based and DevOps options, and in his free time, he additionally enjoys contributing to open supply initiatives.