How AppsFlyer modernized their interactive workload by shifting to Amazon Athena and saved 80% of prices

This publish is co-written with Nofar Diamant and Matan Safri from AppsFlyer.

AppsFlyer develops a number one measurement resolution centered on privateness, which permits entrepreneurs to gauge the effectiveness of their advertising actions and integrates them with the broader advertising world, managing an unlimited quantity of 100 billion occasions on daily basis. AppsFlyer empowers digital entrepreneurs to exactly determine and allocate credit score to the assorted client interactions that lead as much as an app set up, using in-depth analytics.

A part of AppsFlyer’s providing is the Audiences Segmentation product, which permits app house owners to exactly goal and reengage customers primarily based on their conduct and demographics. This features a function that gives real-time estimation of viewers sizes inside particular consumer segments, known as the Estimation function.

To offer customers with real-time estimation of viewers dimension, the AppsFlyer workforce initially used Apache HBase, an open-source distributed database. Nevertheless, because the workload grew to 23 TB, the HBase structure wanted to be revisited to satisfy service stage agreements (SLAs) for response time and reliability.

This publish explores how AppsFlyer modernized their Audiences Segmentation product by utilizing Amazon Athena. Athena is a robust and versatile serverless question service offered by AWS. It’s designed to make it easy for customers to investigate knowledge saved in Amazon Easy Storage Service (Amazon S3) utilizing normal SQL queries.

We dive into the assorted optimization methods AppsFlyer employed, akin to partition projection, sorting, parallel question runs, and the usage of question outcome reuse. We share the challenges the workforce confronted and the methods they adopted to unlock the true potential of Athena in a use case with low-latency necessities. Moreover, we talk about the thorough testing, monitoring, and rollout course of that resulted in a profitable transition to the brand new Athena structure.

Audiences Segmentation legacy structure and modernization drivers

Viewers segmentation entails defining focused audiences in AppsFlyer’s UI, represented by a directed tree construction with set operations and atomic standards as nodes and leaves, respectively.

The next diagram reveals an instance of viewers segmentation on the AppsFlyer Audiences administration console and its translation to the tree construction described, with the 2 atomic standards because the leaves and the set operation between them because the node.

Audience segmentation tool and its translation to a tree structure

To offer customers with real-time estimation of viewers dimension, the AppsFlyer workforce used a framework known as Theta Sketches, which is an environment friendly knowledge construction for counting distinct parts. These sketches improve scalability and analytical capabilities. These sketches had been initially saved within the HBase database.

HBase is an open supply, distributed, columnar database, designed to deal with giant volumes of knowledge throughout commodity {hardware} with horizontal scalability.

Authentic knowledge construction

On this publish, we concentrate on the occasions desk, the biggest desk initially saved in HBase. The desk had the schema date | app-id | event-name | event-value | sketch and was partitioned by date and app-id.

The next diagram showcases the high-level unique structure of the AppsFlyer Estimations system.

High level architecture of the Estimations system

The structure featured an Airflow ETL course of that initiates jobs to create sketch information from the supply dataset, adopted by the importation of those information into HBase. Customers might then use an API service to question HBase and retrieve estimations of consumer counts in keeping with the viewers phase standards arrange within the UI.

To be taught extra concerning the earlier HBase structure, see Utilized Chance – Counting Giant Set of Unstructured Occasions with Theta Sketches.

Over time, the workload exceeded the dimensions for which HBase implementation was initially designed, reaching a storage dimension of 23 TB. It grew to become obvious that with a purpose to meet AppsFlyer’s SLA for response time and reliability, the HBase structure wanted to be revisited.

As beforehand talked about, the main target of the use case entailed every day interactions by clients with the UI, necessitating adherence to a UI normal SLA that gives fast response occasions and the aptitude to deal with a considerable variety of every day requests, whereas accommodating the present knowledge quantity and potential future growth.

Moreover, because of the excessive price related to working and sustaining HBase, the goal was to search out an alternate that’s managed, easy, and cost-effective, that wouldn’t considerably complicate the prevailing system structure.

Following thorough workforce discussions and consultations with the AWS consultants, the workforce concluded {that a} resolution utilizing Amazon S3 and Athena stood out as probably the most cost-effective and simple selection. The first concern was associated to question latency, and the workforce was notably cautious to keep away from any hostile results on the general buyer expertise.

The next diagram illustrates the brand new structure utilizing Athena. Discover that import-..-sketches-to-hbase and HBase had been omitted, and Athena was added to question knowledge in Amazon S3.

High level architecture of the Estimations system using Athena

Schema design and partition projection for efficiency enhancement

On this part, we talk about the method of schema design within the new structure and totally different efficiency optimization strategies that the workforce used together with partition projection.

Merging knowledge for partition discount

To be able to consider if Athena can be utilized to assist Audiences Segmentation, an preliminary proof of idea was carried out. The scope was restricted to occasions arriving from three app-ids (approximated 3 GB of knowledge) partitioned by app-id and by date, utilizing the identical partitioning schema that was used within the HBase implementation. Because the workforce scaled as much as embody your complete dataset with 10,000 app-ids for a 1-month time vary (reaching an approximated 150 GB of knowledge), the workforce began to see extra sluggish queries, particularly for queries that spanned over important time ranges. The workforce dived deep and found that Athena spent important time on the question starting stage attributable to a lot of partitions (7.3 million) that it loaded from the AWS Glue Knowledge Catalog (for extra details about utilizing Athena with AWS Glue, see Integration with AWS Glue).

This led the workforce to look at partition indexing. Athena partition indexes present a approach to create metadata indexes on partition columns, permitting Athena to prune the info scan on the partition stage, which may scale back the quantity of knowledge that must be learn from Amazon S3. Partition indexing shortened the time of partition discovery within the question starting stage, however the enchancment wasn’t substantial sufficient to satisfy the required question latency SLA.

As a substitute for partition indexing, the workforce evaluated a method to cut back partition quantity by decreasing knowledge granularity from every day to month-to-month. This technique consolidated every day knowledge into month-to-month aggregates by merging day-level sketches into month-to-month composite sketches utilizing the Theta Sketches union functionality. For instance, taking an information of a month vary, as an alternative of getting 30 rows of knowledge per thirty days, the workforce united these rows right into a single row, successfully slashing the row depend by 97%.

This technique vastly decreased the time wanted for the partition discovery section by 30%, which initially required roughly 10–15 seconds, and it additionally decreased the quantity of knowledge that needed to be scanned. Nevertheless, the anticipated latency targets primarily based on the UI’s responsiveness requirements had been nonetheless not excellent.

Moreover, the merging course of inadvertently compromised the precision of the info, resulting in the exploration of different options.

Partition projection as an enhancement multiplier

At this level, the workforce determined to discover partition projection in Athena.

Partition projection in Athena lets you enhance question effectivity by projecting the metadata of your partitions. It just about generates and discovers partitions as wanted with out the necessity for the partitions to be explicitly outlined within the database catalog beforehand.

This function is especially helpful when coping with giant numbers of partitions, or when partitions are created quickly, as within the case of streaming knowledge.

As we defined earlier, on this explicit use case, every leaf is an entry sample being translated into a question that should include date vary, app-id, and event-name. This led the workforce to outline the projection columns by utilizing date kind for the date vary and injected kind for app-id and event-name.

Somewhat than scanning and loading all partition metadata from the catalog, Athena can generate the partitions to question utilizing configured guidelines and values from the question. This avoids the necessity to load and filter partitions from the catalog by producing them within the second.

The projection course of helped keep away from efficiency points brought on by a excessive variety of partitions, eliminating the latency from partition discovery throughout question runs.

As a result of partition projection eradicated the dependency between variety of partitions and question runtime, the workforce might experiment with a further partition: event-name. Partitioning by three columns (date, app-id, and event-name) decreased the quantity of scanned knowledge, leading to a ten% enchancment in question efficiency in comparison with the efficiency utilizing partition projection with knowledge partitioned solely by date and app-id.

The next diagram illustrates the high-level knowledge circulate of sketch file creation. Specializing in the sketch writing course of (write-events-estimation-sketches) into Amazon S3 with three partition fields brought on the method to run twice as lengthy in comparison with the unique structure, attributable to an elevated variety of sketch information (writing 20 occasions extra sketch information to Amazon S3).

High level data flow of Sketch file creation

This prompted the workforce to drop the event-name partition and compromise on two partitions: date and app-id, ensuing within the following partition construction:

s3://bucket/table_root/date=${day}/app_id=${app_id}

Utilizing Parquet file format

Within the new structure, the workforce used Parquet file format. Apache Parquet is an open supply, column-oriented knowledge file format designed for environment friendly knowledge storage and retrieval. Every Parquet file incorporates metadata akin to minimal and most worth of columns that enables the question engine to skip loading unneeded knowledge. This optimization reduces the quantity of knowledge that must be scanned, as a result of Athena can skip or rapidly navigate by means of sections of the Parquet file which are irrelevant to the question. In consequence, question efficiency improves considerably.

Parquet is especially efficient when querying sorted fields, as a result of it permits Athena to facilitate predicate pushdown optimization and rapidly determine and entry the related knowledge segments. To be taught extra about this functionality in Parquet file format, see Understanding columnar storage codecs.

Recognizing this benefit, the workforce determined to type by event-name to boost question efficiency, reaching a ten% enchancment in comparison with non-sorted knowledge. Initially, they tried partitioning by event-name to optimize efficiency, however this method elevated writing time to Amazon S3. Sorting demonstrated question time enchancment with out the ingestion overhead.

Question optimization and parallel queries

The workforce found that efficiency could possibly be improved additional by working parallel queries. As a substitute of a single question over a protracted window of time, a number of queries had been run over shorter home windows. Despite the fact that this elevated the complexity of the answer, it improved efficiency by about 20% on common.

For example, take into account a situation the place a consumer requests the estimated dimension of app com.demo and occasion af_purchase between April 2024 and finish of June 2024 (as illustrated earlier, the segmentation is outlined by the consumer after which translated to an atomic leaf, which is then damaged right down to a number of queries relying on the date vary). The next diagram illustrates the method of breaking down the preliminary 3-month question into two separate as much as 60-day queries, working them concurrently after which merging the outcomes.

Splitting query by date range

Lowering outcomes set dimension

In analyzing efficiency bottlenecks, inspecting the different sorts and properties of the queries, and analyzing the totally different phases of the question run, it grew to become clear that particular queries had been sluggish in fetching question outcomes. This drawback wasn’t rooted within the precise question run, however in knowledge switch from Amazon S3 on the GetQueryResults section, attributable to question outcomes containing a lot of rows (a single outcome can include tens of millions of rows).

The preliminary method of dealing with a number of key-value permutations in a single sketch inflated the variety of rows significantly. To beat this, the workforce launched a brand new event-attr-key subject to separate sketches into distinct key-value pairs.

The ultimate schema seemed as follows:

This refactoring resulted in a drastic discount of outcome rows, which considerably expedited the GetQueryResults course of, markedly bettering general question runtime by 90%.

Athena question outcomes reuse

To handle a typical use case within the Audiences Segmentation GUI the place customers usually make delicate changes to their queries, akin to adjusting filters or barely altering time home windows, the workforce used the Athena question outcomes reuse function. This function improves question efficiency and reduces prices by caching and reusing the outcomes of earlier queries. This function performs a pivotal position, notably when making an allowance for the latest enhancements involving the splitting of date ranges. The power to reuse and swiftly retrieve outcomes implies that these minor—but frequent—modifications not require a full question reprocessing.

In consequence, the latency of repeated question runs was decreased by as much as 80%, enhancing the consumer expertise by offering sooner insights. This optimization not solely accelerates knowledge retrieval but in addition considerably reduces prices as a result of there’s no must rescan knowledge for each minor change.

Resolution rollout: Testing and monitoring

On this part, we talk about the method of rolling out the brand new structure, together with testing and monitoring.

Fixing Amazon S3 slowdown errors

In the course of the resolution testing section, the workforce developed an automation course of designed to evaluate the totally different audiences throughout the system, utilizing the info organized throughout the newly carried out schema. The methodology concerned a comparative evaluation of outcomes obtained from HBase in opposition to these derived from Athena.

Whereas working these exams, the workforce examined the accuracy of the estimations retrieved and likewise the latency change.

On this testing section, the workforce encountered some failures when working many concurrent queries without delay. These failures had been brought on by Amazon S3 throttling attributable to too many GET requests to the identical prefix produced by concurrent Athena queries.

To be able to deal with the throttling (slowdown errors), the workforce added a retry mechanism for question runs with an exponential back-off technique (wait time will increase exponentially with a random offset to stop concurrent retries).

Rollout preparations

At first, the workforce initiated a 1-month backfilling course of as a cost-conscious method, prioritizing accuracy validation earlier than committing to a complete 2-year backfill.

The backfilling course of included working the Spark job (write-events-estimation-sketches) within the desired time vary. The job learn from the info warehouse, created sketches from the info, and wrote them to information within the particular schema that the workforce outlined. Moreover, as a result of the workforce used partition projection, they may skip the method of updating the Knowledge Catalog with each partition being added.

This step-by-step method allowed them to verify the correctness of their resolution earlier than continuing with your complete historic dataset.

With confidence within the accuracy achieved through the preliminary section, the workforce systematically expanded the backfilling course of to embody the complete 2-year timeframe, assuring a radical and dependable implementation.

Earlier than the official launch of the up to date resolution, a sturdy monitoring technique was carried out to safeguard stability. Key screens had been configured to evaluate important elements, akin to question and API latency, error charges, API availability.

After the info was saved in Amazon S3 as Parquet information, the next rollout course of was designed:

Maintain each HBase and Athena writing processes working, cease studying from HBase, and begin studying from Athena.
Cease writing to HBase.
Sundown HBase.

Enhancements and optimizations with Athena

The migration from HBase to Athena, utilizing partition projection and optimized knowledge constructions, has not solely resulted in a ten% enchancment in question efficiency, however has additionally considerably boosted general system stability by scanning solely the required knowledge partitions. As well as, the transition to a serverless mannequin with Athena has achieved a powerful 80% discount in month-to-month prices in comparison with the earlier setup. This is because of eliminating infrastructure administration bills and aligning prices instantly with utilization, thereby positioning the group for extra environment friendly operations, improved knowledge evaluation, and superior enterprise outcomes.

The next desk summarizes the enhancements and the optimizations carried out by the workforce.

Space of Enchancment	Motion Taken	Measured Enchancment
Athena partition projection	Partition projection over the big variety of partitions, avoiding limiting the variety of partitions; partition by `event_name` and `app_id`	Lots of of % enchancment in question efficiency. This was probably the most important enchancment, which allowed the answer to be possible.
Partitioning and sorting	Partitioning by `app_id` and sorting `event_name` with every day granularity	100% enchancment in jobs calculating the sketches. 5% latency in question efficiency.
Time vary queries	Splitting very long time vary queries into a number of queries working in parallel	20% enchancment in question efficiency.
Lowering outcomes set dimension	Schema refactoring	90% enchancment in general question time.
Question outcome reuse	Supporting Athena question outcomes reuse	80% enchancment in queries ran greater than as soon as within the given time.

Conclusion

On this publish, we confirmed how Athena grew to become the primary element of the AppsFlyer Audiences Segmentation providing. We explored numerous optimization methods akin to knowledge merging, partition projection, schema redesign, parallel queries, Parquet file format, and the usage of the question outcome reuse.

We hope our expertise supplies useful insights to boost the efficiency of your Athena-based purposes. Moreover, we advocate trying out Athena efficiency greatest practices for additional steerage.

Concerning the Authors

Nofar Diamant is a software program workforce lead at AppsFlyer with a present concentrate on fraud safety. Earlier than diving into this realm, she led the Retargeting workforce at AppsFlyer, which is the topic of this publish. In her spare time, Nofar enjoys sports activities and is enthusiastic about mentoring girls in know-how. She is devoted to shifting the business’s gender demographics by growing the presence of girls in engineering roles and inspiring them to succeed.

Matan Safri is a backend developer specializing in large knowledge within the Retargeting workforce at AppsFlyer. Earlier than becoming a member of AppsFlyer, Matan was a backend developer in IDF and accomplished an MSC in electrical engineering, majoring in computer systems at BGU college. In his spare time, he enjoys wave browsing, yoga, touring, and enjoying the guitar.

Michael Pelts is a Principal Options Architect at AWS. On this place, he works with main AWS clients, aiding them in creating modern cloud-based options. Michael enjoys the creativity and problem-solving concerned in constructing efficient cloud architectures. He additionally likes sharing his in depth expertise in SaaS, analytics, and different domains, empowering clients to raise their cloud experience.

Orgad Kimchi is a Senior Technical Account Supervisor at Amazon Internet Providers. He serves because the buyer’s advocate and assists his clients in reaching cloud operational excellence specializing in structure, AI/ML in alignment with their enterprise targets.