This weblog was co-authored by DeNA Co., Ltd. and Amazon Internet Providers Japan.
DeNA Co., Ltd. (DeNA) engages in quite a lot of companies, from video games and dwell communities to sports activities & the group and healthcare & medical, underneath our mission to thrill individuals past their wildest goals. Amongst these, the healthcare & medical enterprise handles notably delicate information. To adjust to their information insurance policies for delicate information, this healthcare & medical enterprise set the next necessities for his or her information processing:
- Course of information in compliance with information insurance policies – Masks or delete delicate information as obligatory to remodel into anonymized information. Stop the inclusion of invalid values in categorical information and course of information with none information loss.
- Conduct information high quality exams on anonymized information in compliance with information insurance policies – Conduct information high quality exams to rapidly determine and tackle information high quality points, sustaining high-quality information always.
This publish introduces a case research the place DeNA mixed Amazon Redshift Serverless and dbt (dbt Core) to speed up information high quality exams of their enterprise.
The problem
Knowledge high quality exams require performing 1,300 exams on 10 TB of knowledge month-to-month. Beforehand, DeNA ran Python-based batch jobs on Amazon Elastic Compute Cloud (Amazon EC2) to carry out these information high quality exams. As enterprise and information quantity grew over time, DeNA began to face the next challenges:
- Efficiency – Knowledge high quality exams took days to weeks to finish as a result of engineers hadn’t designed the batch jobs to deal with massive information.
- Value – Prices elevated because of the batch job design, notably for giant datasets. The implementation required loading information into reminiscence for processing. When dealing with giant desk information, DeNA wanted to make use of giant memory-optimized EC2 cases.
- Maintainability – The batch job implementations diverse considerably between engineers, resulting in excessive upkeep overhead, as a result of the required data was siloed amongst particular person engineers.
The change to Redshift Serverless and dbt
To deal with these challenges, DeNA determined to undertake Redshift Serverless and dbt (an open supply information transformation software) for the next key causes:
- Scalable and cost-effective processing with Redshift Serverless
- Standardized and maintainable information high quality exams with dbt
This choice was made after cautious comparability of other options. DeNA initially thought-about parallelizing the prevailing Python-based batch jobs however rejected this strategy because of the excessive upkeep overhead and siloed data related to the batch jobs. As an alternative, DeNA determined to make use of dbt, which DeNA has been utilizing of their healthcare & medical enterprise, and join it to an AWS service able to large-scale distributed processing. dbt offers a SQL-first templating engine for repeatable and extensible information transformations, together with a information exams characteristic, which permits verifying information fashions and tables towards anticipated guidelines and situations utilizing SQL. Through the use of dbt, DeNA may standardize the technical stack, implement information high quality exams in maintainable SQL, and join dbt to a managed service for scalable and cost-effective processing.
AWS provides a number of companies which are suitable with dbt, together with Amazon Redshift and AWS Glue. DeNA chosen Redshift Serverless, primarily as a result of its serverless nature, optimum cost-performance, and the superior processing efficiency for structured information typical of an information warehouse service.
Answer overview
DeNA designed the next structure utilizing AWS serverless companies.
The workflow consists of the next high-level steps and key design factors:
- The supply system shops the goal information for the info high quality exams in Amazon Easy Storage Service (Amazon S3). When new information information are added, Amazon EventBridge invokes an AWS Step Capabilities state machine (workflow). To verify all information for goal information are delivered, the supply system shops a completion file in Amazon S3.
- dbt runs on Amazon Elastic Container Service (Amazon ECS) utilizing AWS Fargate, an AWS serverless container service. DeNA chosen Amazon ECS as a result of it permits working dbt in a serverless, pay-per-use method, and DeNA had prior expertise growing and working functions utilizing Amazon ECS. To permit the containers to securely entry Redshift Serverless, DeNA used the cross delicate information to an ECS container characteristic to cross delicate credentials which are saved in AWS Secrets and techniques Supervisor to the containers utilizing an ECS job execution IAM position.
- DeNA segmented Redshift Serverless into separate workgroups for entry management. Operation personnel might must entry the Redshift Serverless database utilizing the Question Editor V2 to analyze points with information high quality exams, whereas sustaining strict entry management. Redshift Serverless permits fine-grained entry management to information by utilizing database security measures, just like how the GRANT command is utilized in database merchandise. Nonetheless, on this workload, DeNA selected to make use of AWS Id and Entry Administration (IAM) to management entry to the workgroups at IAM stage. This allowed DeNA to limit entry to particular Redshift Serverless workgroups primarily based on customers’ IAM roles, enabling unified administration of authorization by way of IAM. Moreover, by separating the workgroups, DeNA may individually alter Redshift Processing Items (RPUs) per workgroup, contributing to value optimization.
- Amazon ECS sends execution logs of dbt working to Amazon CloudWatch Logs for observability. DeNA used metric filters to transform the logs into CloudWatch metrics, then created alarms primarily based on these metrics. When triggered, these alarms invoke AWS Lambda capabilities utilizing Amazon Easy Notification Service (Amazon SNS). The Lambda capabilities create outcome reviews of dbt working and information high quality exams and ship them to an inner chat software. DeNA visualizes the outcomes of knowledge high quality exams utilizing the elementary CLI, a dbt-based information observability resolution. This workflow allows even non-engineers to trace information high quality standing successfully.
Outcomes
DeNA efficiently addressed all of the challenges they confronted by designing the answer and migrating to a brand new platform:
- Efficiency – Improved efficiency as much as 100 instances sooner by lowering processing time from days or even weeks to 1–2 hours. A sure information high quality check that beforehand took 877 minutes now completes in 1 minute, because of the large-scale distributed processing capabilities of Redshift Serverless.
- Value – Lowered prices by 90% with AWS serverless companies. Optimized bills by incurring prices just for information high quality exams.
- Maintainability – Standardized the technical stack with dbt, eliminating siloed data from customized applications. dbt’s information exams characteristic simplified the implementation of knowledge high quality exams. The elementary CLI improved the observability of knowledge high quality exams for non-engineers. AWS serverless companies nearly eradicated the operational overhead for managing the workload infrastructure.
Conclusion
This publish demonstrated how DeNA was in a position to securely and effectively speed up their information high quality exams by combining Redshift Serverless and dbt. This mixture shouldn’t be solely efficient for DeNA’s use case but in addition relevant to numerous enterprise use instances throughout totally different industries.
For extra info on the mixture of Redshift Serverless and dbt, consult with the next sources:
Concerning the Creator
Momota Sasaki is an Engineering Supervisor at DeSC Healthcare, a subsidiary of DeNA. He joined DeNA in 2021 and was seconded to DeSC Healthcare. Since then, he has been constantly concerned within the healthcare enterprise, main and selling the event and operation of the info platform.
Kaito Tawara is a Knowledge Engineer at DeSC Healthcare, a subsidiary of DeNA, specializing in enhancing healthcare information platforms. After gaining expertise in backend improvement for internet techniques and information science, he transitioned to information engineering. He joined DeNA in 2023 and was seconded to DeSC Healthcare. Presently, he works remotely from Nagoya-city, contributing to the enhancement of healthcare information platforms.
Shota Sato is an Analytics Specialist Answer Architect at AWS Japan, specializing in information analytics options powered by AWS for digital native enterprise clients.