Twilio is a buyer engagement platform that powers real-time, personalised buyer experiences for main manufacturers via APIs that democratize communications channels like voice, textual content, chat, and video.
At Twilio, we handle a 20 petabyte-scale Amazon Easy Storage Service (Amazon S3) information lake that serves the analytics wants of over 1,500 customers, processing 2.5 million queries month-to-month, and scanning a mean of 85 PB of knowledge. To satisfy our rising calls for for scalability, rising know-how help, and information mesh structure adoption, we constructed Odin, a multi-engine question platform that gives an abstraction layer constructed on prime of Presto Gateway.
On this submit, we talk about how we designed and constructed Odin, combining Amazon Athena with open-source Presto to create a versatile, scalable information querying answer.
A rising want for a multi-engine platform
Our information platform has been constructed on Presto since its inception, however through the years as we expanded to help a number of enterprise traces and numerous use circumstances, we started to come across challenges associated to scalability, operational overhead, and price administration. Sustaining the platform via frequent model upgrades additionally turned troublesome. These upgrades required important time to guage backwards compatibility, combine with our present information ecosystem, and decide optimum configurations throughout releases.
The executive burden of upgrades and our dedication to minimizing person disruption precipitated our Presto model to fall behind. This prevented us from accessing the most recent options and optimizations out there in later releases. The adoption of Apache Hudi for our transaction-dependent crucial workloads created a brand new requirement which our present Presto deployment model couldn’t help. We would have liked an up-to-date Presto or Trino suitable service to accommodate these use circumstances whereas nonetheless decreasing the operational overhead of sustaining our personal question infrastructure.
Constructing a complete information platform required us to steadiness a number of competing necessities and enterprise constraints. We would have liked an answer that might help numerous workload varieties, from interactive analytics to ETL batch processing, whereas offering the pliability to optimize compute assets based mostly on particular use circumstances. We additionally wished to enhance upon price administration and attribution in our shared multi-tenanted question platform. Moreover, we would have liked to make sure that adopting any new know-how didn’t trigger any disruption to our customers and maintained backward compatibility with present methods through the transition interval.
Choosing Amazon Athena as our trendy analytics engine
Our customers relied on SQL for interactive evaluation, and we wished to protect this expertise and make use of our present jobs and utility code. This meant we would have liked a Presto-compatible analytics service to modernize our information platform.
Amazon Athena is a serverless interactive question service constructed on Presto and Trino that permits you to run queries utilizing a well-recognized ANSI SQL interface. Athena appealed to us because of its compatibility with open-source Trino and its seamless improve expertise. Athena helps to ease the burden of managing a large-scale question infrastructure, and with provisioned capability, affords predictable and scalable pricing for our largest question workloads. Athena’s workgroups offered the question and price administration capabilities we would have liked to effectively help numerous groups and workload patterns with minimal overhead.
The flexibility to mix on-demand and devoted serverless capability fashions permits us to optimize workload distribution for our necessities, attaining the pliability and scalability wanted in a managed question surroundings. To handle latency-sensitive and predictive question workloads, we adopted provisioned capability for its serverless capability assure and workload concurrency management options. For queries that could be ad-hoc and extra versatile in scheduling, we opted to make use of the cost-efficient multi-tenant on-demand mannequin, which optimizes useful resource utilization via shared infrastructure. In parallel to migrating workloads to Athena, we additionally wanted a method to help legacy workloads that use customized implementations of Presto options. This requirement drove us to summary the underlying implementation, permitting us to current customers with a unified interface. This might give us the pliability key to future proof our infrastructure and use probably the most applicable compute for the workload and use case.
The beginning of Odin
The next diagram reveals Twilio’s multi-engine question platform that includes each Amazon Athena and open-source Presto.

Excessive Degree Structure of Odin’s Question Engines
Odin is a Presto-based gateway constructed on Zuul, an open-source L7 utility gateway developed by Netflix. Zuul had already demonstrated its scalability at Twilio, having been efficiently adopted by different inner groups. Since finish customers primarily hook up with the platform by way of a JDBC connector utilizing the Presto Driver (which operates via HTTP calls), Zuul’s specialization in HTTP name administration made it an excellent technical selection for our wants.
Odin features as a central hub for question processing, using a pluggable design that accommodates numerous question frameworks for optimum extensibility and adaptability. To work together with the Odin platform customers are initially directed to an Amazon Software Load Balancer that sits in entrance of the Odin cases working on Amazon EC2. The Odin cases deal with the authentication, routing, and whole question workflow all through the question’s lifetime. Amazon ElastiCache for Redis handles the question monitoring for Athena and Amazon DynamoDB is liable for the sustaining the question historical past. Each question engines, Amazon Athena and the Presto clusters working on Amazon EC2,are supported by the AWS Glue Knowledge Catalog because the metastore repository and question information from our Amazon S3-based information lake.
Routing queries to a number of engines
We had a wide range of use circumstances that have been being served by this question platform and due to this fact we opted to make use of Amazon Athena as our major question engine whereas persevering with to route sure legacy workloads to our Presto clusters. Previous to our architectural redesign, we encountered operational challenges because of our finish customers being tightly certain to particular Presto clusters which led to inevitable disruptions throughout upkeep home windows. Moreover, customers regularly overloaded particular person clusters with numerous workloads starting from light-weight ad-hoc analytics to advanced information warehousing queries and resource-intensive ETL processes. This prompted us to implement a extra subtle routing answer, one which was use case targeted and never tightly certain to the precise underlying compute.
To allow routing throughout a number of question engines inside the similar platform, we developed a question trace mechanism that permits customers to specify their supposed use case. Customers append this trace to the JDBC string by way of the X-Presto-Additional-Credential header, which Odin’s logical routing layer then evaluates alongside a number of elements together with person identification, question origin, and fallback planning. The system additionally assesses whether or not the goal useful resource has ample capability, if not, it reroutes the question to an alternate useful resource with out there capability. Whereas customers present preliminary context via their hints, Odin makes the ultimate routing selections intelligently on the server facet. This method balances person enter with centralized orchestration, making certain constant efficiency and useful resource availability.
For instance, say a person would possibly specify the next connection string when connecting to the Odin platform from a Tableau shopper:
The connection string makes use of the extraCredentials header to sign execution on Athena, the place Odin validates question submission particulars, together with the submitting person and gear, earlier than figuring out the suitable Athena workgroup for preliminary routing. Since this Tableau information supply and person qualify as “crucial queries,” the system routes them to a workgroup backed by capability reservations. Nonetheless, if that workgroup has too many pending queries within the execution queue, Odin’s routing logic routinely redirects to different workgroups with higher out there assets. When vital, queries might in the end path to workgroups working on on-demand capability. By means of this fallback logic, Odin gives built-in load balancing on the routing layer, making certain optimum utilization throughout the underlying compute infrastructure.
Right here is an instance workflow of how our queries are routed to Athena workgroups:

As soon as a question has been submitted to a workgroup for execution, Odin can even log the routing determination in our monitoring system based mostly on Amazon ElastiCache for Redis in order that Odin’s routing logic can preserve real-time consciousness of queue depths throughout all Athena workgroups. Moreover, Odin makes use of Amazon EventBridge to combine with Amazon Athena to maintain observe of a question state change and create event-based workflows. Our Redis-based question monitoring system successfully handles edge circumstances, resembling when a JDBC shopper terminates mid-query. Even throughout such surprising interruptions, the platform persistently maintains and updates the correct state of the question.
Question historical past
Following profitable question routing to both an Athena workgroup or one in all our open-source Presto clusters, Odin persists the question identifier and vacation spot endpoint in a question historical past desk in DynamoDB. This design makes use of a RESTful structure the place preliminary question submissions function as POST requests, whereas subsequent standing checks operate as GET requests that make the most of DynamoDB because the authoritative lookup mechanism to find and ballot the suitable execution engine. By centralizing question execution data in DynamoDB quite than sustaining state on particular person servers, we’ve created a very stateless system the place incoming requests will be dealt with by any Amazon EC2 occasion internet hosting our Odin net service.
Classes discovered
The transition from open-source Presto to Athena required some adaptation time, because of delicate variations in how these question engines function. Since our Odin framework was constructed on the Presto driver, we would have liked to change our processing method to make sure compatibility between each methods.
As we started to undertake Athena for extra use circumstances, we seen a distinction within the report counts between Athena and the unique Presto queries. We found this was because of open-source Presto returning outcomes with each web page containing a header column, whereas Athena outcomes solely comprise the header column on the primary web page and subsequent pages containing data solely. This distinction meant that for a 60-page consequence set, Athena would return 59 fewer rows than open-source Presto. As soon as we recognized this pagination conduct, we optimized Odin’s consequence dealing with logic to correctly interpret and course of Athena’s format, in order that queries would return correct outcomes.
Because of the nature of utilizing the Odin platform, most of our interactions with the Athena service are API pushed so we make use of the ResultSet object with the GetQueryResults API to retrieve question execution information. Utilizing this mechanism, the API returns the info as all VARCHAR information sort, even for advanced varieties resembling row, map, or array. This created a problem as a result of Odin makes use of the Presto driver for question parsing, leading to a sort mismatch between the anticipated codecs and precise returned information. To handle this, we applied a translation layer inside the Odin framework that converts all information varieties to VARCHAR and handles any downstream implications of this conversion internally.
These technical changes, whereas initially difficult, highlighted the significance of fastidiously managing the delicate variations between completely different question execution engines when constructing a unified information platform.
Scale of Odin and looking out forward
The Odin platform serves over 1,500 customers who execute roughly 80,000 queries every day, totaling 2.5 million queries per 30 days. Odin additionally powers greater than 5,000 Enterprise Intelligence (BI) studies and dashboards for Tableau and Looker. The queries are executed throughout our multi-engine panorama of greater than 30 workgroups in Athena based mostly on each provisioned capability and on-demand workgroups and 4 Presto clusters on working on EC2 cases with Auto Scaling enabled that run on common 180 cases every. As Twilio continues to expertise speedy development, our Odin platform has enabled us to mature our know-how stacks by each upgrading present compute assets and integrating new applied sciences. We will do all this with out disrupting the expertise for our finish customers. Whereas Odin serves as our basis, we’re excited to proceed to develop this pluggable infrastructure. Our roadmap consists of migrating our self-managed open-source Presto implementation to EMR Trino, introducing Apache Spark as a compute engine by way of Amazon EMR Serverless or AWS Glue jobs, and integrating generative AI capabilities to intelligently route queries throughout Odin’s numerous compute choices.
Conclusion
On this submit, we’ve shared how we constructed Odin, our unified multi-engine question platform. By combining AWS companies like Amazon Athena, Amazon ElastiCache for Redis, and Amazon DynamoDB with our open-source know-how stack, we created a clear abstraction layer for customers. This integration has resulted in a extremely out there and resilient platform surroundings that serves our question processing wants.
By embracing this multi-engine method, not solely did we remedy our question infrastructure challenges however we additionally established a versatile basis that can proceed to evolve with our information wants, making certain we will ship highly effective insights at scale no matter how know-how traits shift sooner or later.
To study extra and get began utilizing Amazon Athena, please see the Athena Person Information.
