How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

Delivering on the promise of real-time agentic automation requires a quick, dependable, and scalable knowledge basis. At UiPath, we wanted a contemporary streaming structure to underpin merchandise like Maestro and Insights, enabling close to real-time visibility into agentic automation metrics as they unfold. That journey led us to unify batch and streaming on Azure Databricks utilizing Apache Spark™ Structured Streaming, enabling cost-efficient, low-latency analytics that assist agentic decision-making throughout the enterprise.

This weblog particulars the technical method, trade-offs, and affect of those enhancements.

With Databricks-based streaming, we have achieved sub-minute event-to-warehouse latency whereas delivering simplified structure and future-proof scalability, setting the brand new commonplace for event-driven knowledge processing throughout UiPath.

Why Streaming Issues for UiPath Maestro and UiPath Insights

At UiPath, merchandise like Maestro and Insights rely closely on well timed, dependable knowledge. Maestro acts because the orchestration layer for our agentic automation platform; coordinating AI brokers, robots, and people primarily based on real-time occasions. Whether or not it’s reacting to a system set off, executing a long-running workflow, or together with a human-in-the-loop step, Maestro depends upon quick, correct sign processing to make the proper selections.

UiPath Insights, which powers monitoring and analytics throughout these automations, provides one other layer of demand: capturing key metrics and behavioral indicators in close to actual time to floor tendencies, calculate ROI, and assist difficulty detection.

Delivering these sorts of outcomes – reactive orchestration and real-time observability – requires a knowledge pipeline structure that’s not solely low-latency, but additionally scalable, dependable, and maintainable. That want is what led us to rethink our streaming structure on Azure Databricks.

Constructing the Streaming Information Basis

Delivering on the promise of highly effective analytics and real-time monitoring requires a basis of scalable, dependable knowledge pipelines. Over the previous few years, we now have developed and expanded a number of pipelines to assist new product options and reply to evolving enterprise necessities. Now, we now have the chance to evaluate how we are able to optimize these pipelines to not solely save prices, but additionally have higher scalability, and at-least as soon as supply assure to assist knowledge from new companies like Maestro.

Whereas our earlier setup (proven above) labored effectively for our clients, it additionally revealed areas for enchancment:

The batching pipeline launched as much as half-hour of latency and relied on a posh infrastructure
The actual-time pipeline delivered sooner knowledge however got here with increased value.
For Robotlogs, our largest dataset, we maintained separate ingestion and storage paths for each historic and real-time processing, leading to duplication and inefficiency.
To assist the brand new ETL pipeline for UiPath Maestro, a brand new UiPath product, we would want to realize at-least as soon as supply assure.

To handle these challenges, we undertook a serious architectural overhaul. We merged the batching and real-time ingestion processes for Robotlogs right into a single pipeline, and re-architected the real-time ingestion pipeline to be extra cost-efficient and scalable.

Why Spark Structured Streaming on Databricks?

As we got down to simplify and modernize our pipeline structure, we wanted a framework that might deal with each high-throughput batch workloads and low-latency real-time knowledge—with out introducing operational overhead. Spark Structured Streaming (SSS) on Azure Databricks was a pure match.

Constructed on high of Spark SQL and Spark Core, Structured Streaming treats real-time knowledge as an unbounded desk—permitting us to reuse acquainted Spark batch constructs whereas gaining the advantages of a fault-tolerant, scalable streaming engine. This unified programming mannequin diminished complexity and accelerated improvement.

We had already leveraged Spark Structured Streaming to develop our Actual-time Alert function, which makes use of stateful stream processing in Databricks. Now, we’re increasing its capabilities to construct our subsequent era of Actual-time ingestion pipelines, enabling us to realize low-latency, scalability, value effectivity, and at-least-once supply ensures.

The Subsequent Technology of Actual-time Ingestion

Our new structure, proven under, dramatically simplifies the information ingestion course of by consolidating beforehand separate elements right into a unified, scalable pipeline utilizing Spark Structured Streaming on Databricks:

On the core of this new design is a set of streaming jobs that learn instantly from occasion sources. These jobs carry out parsing, filtering, flattening, and—most critically—be a part of every occasion with reference knowledge to counterpoint it earlier than writing to our knowledge warehouse.

We orchestrate these jobs utilizing Databricks Lakeflow Jobs, which helps handle retries and job restoration in case of transient failures. This streamlined setup improves each developer productiveness and system reliability.

The advantages of this new structure embody:

Value effectivity: Saves COGS by lowering infrastructure complexity and compute utilization
Low latency: Ingestion latency averages round one minute, with the pliability to scale back this additional
Future-proof scalability: Throughput is proportional to the variety of cores, and we are able to scale out infinitely
No knowledge misplaced: Spark does the heavy-lifting of failure restoration, supporting at-least as soon as supply.
- With downstream sink deduplication in future improvement, will probably be capable of obtain precisely as soon as supply
Quick improvement cycle due to the Spark DataFrame API
Easy and unified structure

Low-Latency

Our streaming job at present runs in micro-batch mode with a one-minute set off interval. Which means that from the second an occasion is printed to our Occasion Bus, it sometimes lands in our knowledge warehouse round 27 seconds on median, with 95% of information arriving inside 51 seconds, and 99% inside 72 seconds.

Structured Streaming gives configurable set off settings, which may even carry down the latency to some seconds. For now, we’ve chosen the one-minute set off as the proper stability between value and efficiency, with the pliability to decrease it sooner or later if necessities change.

Scalability

Spark divides the massive knowledge work by partitions, which absolutely make the most of the Employee/Executor CPU cores. Every Structured Streaming job is cut up into phases, that are additional divided into duties, every of which runs on a single core. This stage of parallelization permits us to completely make the most of our Spark cluster and scale effectively with rising knowledge volumes.

Due to optimizations like in-memory processing, Catalyst question planning, whole-stage code era, and vectorized execution, we course of round 40,000 occasions per second in scalability validation. If site visitors will increase, we are able to scale out just by rising partition counts on the supply Occasion Bus and including extra employee nodes—making certain future-proof scalability with minimal engineering effort.

Supply Assure

Spark Structured Streaming gives exactly-once supply by default, due to its checkpointing system. After every micro-batch, Spark persists the progress (or “epoch”) of every supply partition as write-ahead logs and the job’s utility state in state retailer. Within the occasion of a failure, the job resumes from the final checkpoint—making certain no knowledge is misplaced or skipped.

That is talked about within the authentic Spark Structured Streaming analysis paper, which states that reaching exactly-once supply requires:

The enter supply to be replayable
The output sink to assist idempotent writes

However there’s additionally an implicit third requirement that always goes unstated: the system should have the ability to detect and deal with failures gracefully.

That is the place Spark works effectively—its sturdy failure restoration mechanisms can detect activity failures, executor crashes, and driver points, and routinely take corrective actions equivalent to retries or restarts.

Be aware that we’re at present working with at-least as soon as supply, as our output sink shouldn’t be idempotent but. If we now have additional necessities of exactly-once supply sooner or later, so long as we put additional engineering efforts into idempotency, we must always have the ability to obtain it.

Uncooked Information is Higher

We’ve got additionally made another enhancements. We’ve got now included and continued a standard rawMessage area throughout all tables. This column shops the unique occasion payload as a uncooked string. To borrow the sushi precept (though we imply a barely totally different factor right here): uncooked knowledge is best.

Uncooked knowledge considerably simplifies troubleshooting. When one thing goes unsuitable—like a lacking area or sudden worth—we are able to immediately discuss with the unique message and hint the difficulty, with out chasing down logs or upstream methods. With out this uncooked payload, diagnosing knowledge points turns into a lot more durable and slower.

The draw back is a small improve in storage. However due to low cost cloud storage and the columnar format of our warehouse, this has minimal value and no affect on question efficiency.

Easy and Highly effective API

The brand new implementation is taking us much less improvement time. That is largely due to the DataFrame API in Spark, which gives a high-level, declarative abstraction over distributed knowledge processing. Up to now, utilizing RDDs meant manually reasoning about execution plans, understanding DAGs, and optimizing the order of operations like joins and filters. DataFrames enable us to give attention to the logic of what we need to compute, relatively than the right way to compute it. This considerably simplifies the event course of.

This has additionally improved operations. We not have to manually rerun failed jobs or hint errors throughout a number of pipeline elements. With a simplified structure and fewer shifting components, each improvement and debugging are considerably simpler.

Driving Actual-Time Analytics Throughout UiPath

The success of this new structure has not gone unnoticed. It has rapidly change into the brand new commonplace for real-time occasion ingestion throughout UiPath. Past its preliminary implementation for UiPath Maestro and Insights, the sample has been extensively adopted by a number of new groups and tasks for his or her real-time analytics wants, together with these engaged on cutting-edge initiatives. This widespread adoption is a testomony to the structure’s scalability, effectivity, and extensibility, making it simple for brand spanking new groups to onboard and enabling a brand new era of merchandise with highly effective real-time analytics capabilities.

For those who’re trying to scale your real-time analytics workloads with out the operational burden, the structure outlined right here presents a confirmed path, powered by Databricks and Spark Structured Streaming and able to assist the subsequent era of AI and agentic methods.

About UiPath
UiPath (NYSE: PATH) is a worldwide chief in agentic automation, empowering enterprises to harness the complete potential of AI brokers to autonomously execute and optimize complicated enterprise processes. The UiPath Platform™ uniquely combines managed company, developer flexibility, and seamless integration to assist organizations scale agentic automation safely and confidently. Dedicated to safety, governance, and interoperability, UiPath helps enterprises as they transition right into a future the place automation delivers on the complete potential of AI to remodel industries.

How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

Why Streaming Issues for UiPath Maestro and UiPath Insights

Constructing the Streaming Information Basis

Why Spark Structured Streaming on Databricks?

The Subsequent Technology of Actual-time Ingestion

Low-Latency

Scalability

Supply Assure

Uncooked Information is Higher

Easy and Highly effective API

Driving Actual-Time Analytics Throughout UiPath

Related Articles

Why I pair my Galaxy Watch with a Samsung smartphone

Decide says FTC investigation into Media Issues ‘ought to alarm all Individuals’

Making SD-WAN Smarter with MCP: A Developer’s Information

LEAVE A REPLY Cancel reply

Latest Articles

Why I pair my Galaxy Watch with a Samsung smartphone

Decide says FTC investigation into Media Issues ‘ought to alarm all Individuals’

Making SD-WAN Smarter with MCP: A Developer’s Information

The Obtain: Taiwan’s silicon protect, and ChatGPT’s character misstep

Individuals are ingesting much less alcohol than ever. That’s nice information.