It is a visitor publish co-written with Shashidhar Soppin, Manochandra Menni and Anchal Kansal from Zeta.
Zeta is a core banking know-how supplier that permits banks to quickly launch extensible banking property and legal responsibility merchandise. Zeta’s main merchandise are Olympus and Tachyon. Olympus is a platform as a service (PaaS) that simplifies constructing and working cloud-native, safe and distributed multi-tenant software program as a service (SaaS) merchandise. It blends infrastructure as code and GitOps methodologies for environment friendly and constant deployment of SaaS merchandise. Its structure prioritizes robust tenant isolation, real-time occasion processing, and complete observability, supporting sturdy API integrations and seamless deployment. Zeta’s Tachyon is a full-stack, cloud-native, API-first digital-banking SaaS service delivered by way of Olympus. The banking providers of Tachyon embody cost engines (for UPI, credit score, debit, and pay as you go playing cards), financial savings & checking account administration, and so forth. Tachyon is a contemporary debit processing product with private finance administration and card controls. It’s designed to extend utilization, upsell credit score, scale back fraud, and enhance buyer satisfaction. The Tachyon product affords complete provisioning, funds, and account administration APIs and SDKs, enabling seamless integration of monetary merchandise into third-party apps with out compromising privateness and safety. Zeta operates Tachyon as a multi-tenant SaaS product, serving clients who’re configured as particular person tenants inside the system. Zeta’s know-how stack is monitored by their Buyer Service Navigator product (CSN), which is a part of Olympus.
As a world SaaS supplier, Zeta wanted an answer able to monitoring tenants, measuring SLAs, assembly native regulatory necessities, and scaling effectively with each new tenant onboarding and seasonal utilization spikes. Zeta sought an economical, scalable system that would offer a unified “single pane of glass” to observe the appliance providers, cloud infrastructure, open-source parts, and third-party merchandise.
Zeta confronted a formidable problem in orchestrating a cohesive monitoring system throughout a quickly increasing multi-tenant atmosphere, numerous domains, and quite a few instruments. As extra tenants joined their system, the complexity grew exponentially, making Zeta’s monitoring answer more and more troublesome to take care of. The first problem stemmed from fragmented monitoring instruments that made it troublesome to shortly determine root causes throughout interconnected methods, resulting in extended troubleshooting occasions and potential service degradation. When customers reported points, comparable to bank card cost issues, Web site Reliability Engineering (SRE) group needed to navigate via a a number of disparate monitoring instruments and siloed knowledge, and the dearth of built-in observability resulted in time-consuming guide correlation efforts. This multi-tenant, multi-solution panorama considerably difficult the flexibility to take care of constant monitoring requirements and repair ranges. The problem was additional difficult by the advanced regulatory panorama, the place world growth required adherence to numerous native rules, necessitating a versatile structure able to accommodating various knowledge retention insurance policies and entry controls throughout totally different jurisdictions. Every new tenant addition multiplied the complexity of balancing the monitoring wants of inside SRE groups and clients, requiring subtle knowledge segregation and entry administration. Moreover, Zeta required complete anomaly detection capabilities throughout methods, parts, infrastructure, and operations, requiring an answer that would scale dynamically whereas establishing dynamic baselines and figuring out delicate patterns that may point out rising points. Because the tenant base continued to develop, the necessity for a unified, scalable monitoring answer that would streamline these processes, improve operational visibility, and keep system integrity turned important.
Zeta’s aim was to streamline their processes and improve operational visibility throughout your complete know-how panorama. By addressing these challenges, Zeta aimed to create a unified observability answer that may considerably enhance incident response occasions, improve regulatory compliance posture, and in the end ship a extra dependable and performant service to their world buyer base.
On this publish we clarify how Zeta constructed a extra unified monitoring answer utilizing Amazon OpenSearch Service that improved efficiency, decreased guide processes, and elevated end-user satisfaction. Zeta has achieved over an 80% discount in imply time to decision (MTTR), with incident response occasions lowering from 30+ minutes to underneath 5 minutes.
Resolution overview
Zeta designed and constructed an observability system, CSN, to ship complete visibility throughout the service atmosphere. CSN is a part of the Olympus suite of merchandise. CSN serves as the first interface for the SRE group, providing real-time service well being dashboards, infrastructure monitoring, SLA efficiency analytics, and an admin panel for consumer administration. The system is provided with single sign-on (SSO) integration and enforces role-based entry management (RBAC) to allow safe, granular entry. With CSN, SREs can effectively monitor system well being, obtain actionable alerts and warnings, and handle operational workflows throughout important providers.
CSN is powered by OpenSearch Service to offer an built-in answer for DevOps and Web site Reliability Engineers to assist determine important occasions and points. Zeta selected OpenSearch Service as a result of it affords a totally managed, open-source search analytics engine that scales effortlessly to deal with the growing variety of tenants, related knowledge development, and analytics wants. It’s seamless integration with AWS providers, sturdy safety features, and assist for real-time knowledge ingestion and querying make it perfect for powering the CSN dashboards and analytics workloads. The next diagram illustrates the CSN deployment structure.
The OpenSearch Service area makes use of the Multi-AZ with Standby deployment mannequin, following AWS greatest practices for top availability and fault tolerance. Nodes—together with devoted cluster supervisor nodes, knowledge nodes, and UltraWarm nodes—are distributed evenly throughout three Availability Zones in the identical AWS Area. Availability Zones 1 and a pair of deal with energetic indexing and search visitors, and Availability Zone 3 incorporates standby nodes that stay passive throughout regular operations. If an Availability Zone failure happens, OpenSearch Service mechanically promotes standby nodes to energetic standing, sustaining cluster operations with minimal disruption and no want for knowledge redistribution.
The OpenSearch cluster consists of three devoted cluster supervisor nodes and a multiple-of-three knowledge node rely to take care of quorum and balanced shard allocation. Every index makes use of at the least two replicas, offering redundant copies of knowledge throughout the Availability Zones. This Multi-AZ with Standby configuration delivers excessive resilience and speedy failover, supporting steady service availability and sturdy catastrophe restoration for the observability workloads.
Knowledge assortment and ingestion
The observability technique facilities on an information assortment and ingestion pipeline designed to deal with the complexity and scale. The structure, as proven within the following diagram, addresses three important knowledge sorts: AWS useful resource logs, utility logs, and distributed traces, with every knowledge sort utilizing tailor-made assortment and processing strategies optimized for the workloads.
AWS useful resource logs assortment
The infrastructure spans a number of AWS providers together with Amazon Elastic Kubernetes Service(Amazon EKS), Amazon Relational Database Service(Amazon RDS), Amazon Redshift, Utility Load Balancer, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Elastic Compute Cloud (Amazon EC2) and extra. Zeta makes use of Amazon CloudWatch Logs as the first assortment level for AWS service logs, which offers native integration with these providers.
AWS providers ship their logs on to CloudWatch Logs, that are then pulled by Fluentd working on the Amazon EKS cluster for centralized processing. This method natively captures operational knowledge from the AWS sources, together with:
- Database operational logs and audit trails from Amazon RDS cases
- Knowledge warehouse question execution logs from Amazon Redshift
- Utility Load Balancer entry logs capturing visitors patterns and efficiency metrics
- Kafka cluster operational logs from Amazon MSK
- AWS API invocation audit trails from AWS CloudTrail
- Container runtime and working system logs from Amazon EC2
- In the course of the log assortment, personally identifiable info (PII) is filtered out. The answer adheres strictly to PCI-DSS tips all through this course of.
Zeta used Amazon MSK as a scalable and dependable spine for gathering and streaming logs from numerous sources throughout the AWS sources. Logs are ingested into Amazon MSK, offering a sturdy and fault-tolerant buffer that decouples log producers from customers. This structure permits real-time log streaming and helps superior processing pipelines earlier than the logs are routed to the OpenSearch Service. By integrating Amazon MSK into the logging workflow, scalability, resilience, and suppleness is improved, so that prime log volumes are effectively managed with out impacting downstream methods. This method, mixed with native AWS integrations, minimizes operational complexity and maintains complete, centralized log visibility throughout the cloud atmosphere.
Fluentd processes these logs and routes them on to OpenSearch Service, sustaining the advantages of AWS integration whereas offering centralized accessibility. This centralized logging method with built-in buffering capabilities reduces the direct load on OpenSearch Service by batching and optimizing log supply, serving to to forestall potential ingestion bottlenecks throughout high-volume durations. The method alleviates the necessity for customized log transport brokers on AWS sources, lowering operational overhead whereas sustaining complete protection of the cloud infrastructure.
Utility logs processing
For application-level observability, a pipeline utilizing Fluentd is deployed as Kubernetes DaemonSet. Utility microservices working on Amazon EKS generate logs that Fluentd DaemonSets acquire, parses, and enrich with metadata comparable to pod names, namespaces, and repair identifiers. The processed logs then circulation via Amazon MSK for dependable, high-throughput message streaming earlier than last processing by Fluentd and indexing in OpenSearch Service.
This Kafka-based method offers a number of benefits:
- Decoupling – This helps producers and customers to function independently, in order that Zeta can scale ingestion and processing individually primarily based on demand.
- Backpressure dealing with – Utilizing Kafka’s buffering capabilities, this manages visitors spikes throughout peak banking hours, absorbing sudden will increase in log quantity whereas sustaining system stability throughout seasonal utilization surges.
- Sturdiness of logs – The system maintains logs durably in order that no log knowledge is misplaced throughout system upkeep or sudden failures via message persistence.
The logs then cross via a second Fluentd layer for last processing and routing to OpenSearch Service, the place they’re listed throughout service-specific indexes (app-index
, falco-index,
kong-index
).
Distributed hint assortment
To handle the problem of correlating points throughout Zeta’s microservices structure, system makes use of distributed tracing utilizing Jaeger, an open-source, end-to-end distributed tracing system. Jaeger permits monitoring and troubleshooting transactions in advanced distributed methods by monitoring requests as they circulation via a number of providers. The applying providers and Kong API Gateway are instrumented with Jaeger shopper libraries that generate hint knowledge together with spans, which characterize particular person operations inside a hint. Every span incorporates metadata comparable to operation names, begin and end timestamps, tags, and logs that present context in regards to the operation being carried out. The Jaeger Collector aggregates these spans from a number of providers, performing validation, indexing, and transformation earlier than forwarding the info.
The traces circulation via Amazon MSK for a similar reliability advantages because the logging pipeline – offering sturdiness, decoupling, and backpressure dealing with throughout high-volume durations. Jaeger Ingester then consumes traces from Amazon MSK and processes them for storage within the jaeger-index
inside OpenSearch Service.
This knowledge assortment and ingestion technique offers full end-to-end visibility and builds an observability system that permits SRE groups to observe, troubleshoot, and optimize the providers throughout your complete know-how stack.
Storage tiering
To handle the log, metric, and hint knowledge at scale—about 3TB generated day by day—the answer applied OpenSearch Service storage tiering to stability efficiency, retention, and value. Zeta requires close to real-time search and retrieval for at the least per week, whereas retaining logs and traces for as much as 10 years. Retaining this knowledge in energetic clusters would affect search efficiency and considerably improve prices, so the answer makes use of the OpenSearch Service scorching, UltraWarm, and chilly storage tiers to optimize the info lifecycle. The next diagram illustrates storage tiering in OpenSearch Service.
Sizzling storage is used for the newest and ceaselessly accessed knowledge, supporting real-time indexing and low-latency queries. This tier depends on high-performance storage hooked up to plain knowledge nodes, making it perfect for powering reside dashboards and analytics the place velocity is important. The answer makes use of AWS Graviton 2 powered m6g.4xlarge.search occasion sorts to run the OpenSearch Service area which offers upto 40% decrease price in comparison with x86 primarily based cases. Every scorching knowledge node has an hooked up gp3 EBS quantity to retailer indexes. Zeta maintains knowledge in scorching storage for 1 week.
UltraWarm storage serves as an economical layer for older, read-only knowledge that’s queried much less ceaselessly however nonetheless wants to stay searchable. UltraWarm nodes use Amazon Easy Storage Service (Amazon S3) because the backing retailer with an built-in caching mechanism, to retain massive volumes of knowledge at a fraction of the price of scorching storage whereas nonetheless supporting interactive queries for historic evaluation. Zeta makes use of ultrawarm1.massive.search occasion sorts within the UltraWarm storage tier and maintains knowledge in UltraWarm storage for 15 days.
Chilly storage is designed for long-term archival of occasionally accessed or compliance-driven knowledge. Knowledge in chilly storage is indifferent from energetic compute sources and resides in Amazon S3, incurring minimal price. When historic knowledge must be queried, the indexes are hooked up to the UltraWarm nodes utilizing OpenSearch API calls. This helps extracting historic knowledge for audits, periodic analysis or forensic investigations with out sustaining energetic compute for your complete retention interval, thereby lowering storage price.
OpenSearch Service automates index transitions between scorching, UltraWarm, and chilly storage tiers utilizing Index State Administration (ISM) insurance policies. ISM insurance policies specify the situations and actions for every state, comparable to transitioning primarily based on index age, measurement, or doc rely. When an index qualifies for a transition, ISM jobs—working each 5 to eight minutes—consider the coverage and transfer the index to the subsequent tier. When indexes attain the UltraWarm threshold, they’re migrated to UltraWarm nodes backed by Amazon S3, which reduces storage prices whereas retaining knowledge accessible for queries. After the UltraWarm retention interval, ISM archives the indexes to chilly storage, detaching them from compute sources however permitting reattachment for future queries or compliance wants. This automated lifecycle administration reduces operational overhead, optimizes storage prices, and maintains efficiency for each latest and historic knowledge.
For observability knowledge, new indexes are created within the scorching tier, the place they continue to be for 7 days to assist quick ingestion and low-latency queries. After this era, ISM transitions these indexes to UltraWarm storage, the place they’re retained for an extra 15 days as read-only knowledge, balancing price with searchability.
Safety
Safety is probably the most important a part of the structure. Zeta’s observability system implements a number of layers of safety for knowledge confidentiality, integrity, and compliance with banking rules, and is constructed utilizing a zero-trust method following the AWS shared duty mannequin for OpenSearch Service:
- Infrastructure safety: The OpenSearch Service area is deployed inside a digital personal cloud (VPC) with personal subnets, isolating it from direct web entry. Safety teams implement restrictive ingress guidelines, permitting entry solely from licensed sources. The OpenSearch Service area makes use of encryption at relaxation via AWS Key Administration Service (KMS). Knowledge in transit is secured utilizing TLS 1.3 encryption, in order that log knowledge, traces, and search queries stay protected throughout transmission. Service-to-service communication makes use of AWS Id and Entry Administration (IAM) roles and encrypted connections, assuaging the necessity for hardcoded credentials.
- Entry management and authentication: The answer makes use of Amazon OpenSearch Service fine-grained entry management(FGAC) built-in with IAM, the place IAM serves because the authentication supplier and FGAC handles authorization by mapping IAM roles to OpenSearch backend roles. This method helps Zeta to regulate entry permissions on the index and doc degree primarily based on tenant necessities and consumer duties. The info ingestion pipeline implements end-to-end safety with Fluentd authenticating to Amazon MSK utilizing IAM roles over encrypted connections. Amazon MSK clusters use encryption in transit and at relaxation, defending log knowledge all through the streaming pipeline. Kubernetes RBAC insurance policies limit pod-to-pod communication and restrict service account permissions.
- Knowledge privateness and tenant isolation: Every tenants’ knowledge is maintained in logical separation in OpenSearch Service utilizing tenant id. CSN implements tenant-aware authentication and authorization with FGAC, limiting customers to their licensed tenants’ dashboards and knowledge. Each API endpoint validates tenant context, in order that customers can solely entry knowledge inside their licensed scope. Importantly, no buyer knowledge is captured within the logs – solely system metrics are used to construct the monitoring system, adhering to banking safety requirements and greatest practices. Person actions are audited and logged for compliance functions, with audit trails maintained in accordance with regulatory necessities.
This safety framework permits the observability system meet the safety necessities of core banking operations whereas sustaining operational effectivity and regulatory compliance throughout world industries.
Buyer Service Navigator
CSN delivers SREs a robust diagnostics interface engineered for high-efficiency monitoring, deep evaluation, and speedy troubleshooting of system efficiency throughout distributed environments. The system ingests and processes telemetry knowledge at sub-minute intervals, offering near-real-time metrics, traces, and logs from important infrastructure parts. Actionable, interactive visualizations—comparable to heatmaps, anomaly graphs, and dependency maps— helps SREs to shortly detect SLO breaches and drill right down to granular root causes, usually inside a couple of minutes of an incident.
The next screenshot exhibits an instance service well being dashboard in CSN for an Olympus tenant.
The next screenshot exhibits an instance of the API efficiency insights dashboard in CSN.
Enterprise and technical advantages
The OpenSearch Service-based CSN System offers the next enterprise and technical advantages:
- Guide effort is decreased via automated Index State Administration (ISM) and lifecycle insurance policies, in order that Zeta’s groups to deal with innovation
- Automated lifecycle insurance policies facilitate seamless retention and archiving of compliance knowledge, lowering the danger of non-compliance
- The system helps log retention for over 10 years to fulfill regulatory necessities for Zeta’s banking and monetary providers clients
- A number of layers of safety—together with encryption at relaxation and in transit, FGAC, and tenant isolation to guard buyer knowledge and assist Zeta’s zero-trust structure
- By consolidating logs, traces, and metrics from disparate methods into OpenSearch, SRE groups can correlate occasions extra successfully, thereby lowering troubleshooting efforts and reaching an 80% enchancment in MTTR
- Zeta achieved 99.999999999% knowledge sturdiness for archived logs saved in Amazon S3, offering long-term knowledge integrity
- Zstandard compression is being applied to optimize long-term storage prices
Conclusion
CSN’s superior correlation engine mechanically associates associated occasions throughout microservices, databases, community layers, and infrastructure, considerably streamlining root trigger evaluation. Built-in alerting and automatic runbooks additional scale back response occasions. Since implementing CSN, Zeta has achieved over an 80% discount in MTTR, with incident response occasions lowering from 30+ minutes to underneath 5 minutes. The service helps seamless multi-tenant monitoring, processes 3TB of machine-generated knowledge day by day, and is architected for petabyte-scale development. Moreover, CSN helps Zeta meet regulatory necessities for retaining historic logs over a number of years whereas retaining storage prices underneath management. This has considerably improved operational resilience, elevated service availability, and empowered groups to proactively resolve points earlier than they have an effect on finish customers.
Able to take your group’s observability capabilities to the subsequent degree? Dive into the technical particulars of OpenSearch Service within the Amazon OpenSearch Developer Information. Go to our new migration hub web page for extra prescriptive steerage on transferring your workloads to OpenSearch Service.
Concerning the authors
Deepesh Dhapola is a Senior Options Architect at AWS India, the place he architects high-performance, resilient cloud options for monetary providers and fintech organizations. He focuses on utilizing superior AI applied sciences—together with generative AI, clever brokers, and the Mannequin Context Protocol (MCP)—to design safe, scalable, and context-aware purposes. With deep experience in machine studying and a eager deal with rising developments, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to reinforce operational effectivity and foster innovation for AWS clients. Past his technical pursuits, he enjoys high quality time together with his household and explores inventive culinary methods.
Shashidhar (Shashi) Soppin is an completed Enterprise Architect and cloud transformation chief with over 24+ years of expertise spanning regulated industries and high-growth know-how environments. At present steering strategic initiatives as Lead Architect at Zeta’s CTO workplace, Shashidhar has helped in constructing and led world-class engineering groups, driving innovation in cloud, safety, and fintech domains. He has architected safe, scalable platforms—scaling consumer bases by 10x, enabling advanced integrations for main Financial institution’s migration to Zeta’s platforms, and pioneering Zero Belief frameworks that achieved excellent regulatory compliance. A results-driven govt and former DMTS at Wipro, Shashidhar holds 25+ granted patents and has delivered multi-million greenback enterprise offers throughout domains together with AI/ML. Famend as a broadcast creator (“Necessities of Deep Studying”), frequent trade speaker, and hands-on innovator, he combines technical experience with enterprise acumen, propelling organizations towards sturdy, future-ready cloud ecosystems and operational excellence. Previous to Wipro he labored in IBM-ISL as effectively.
Anchal Kansal is a Lead Web site Reliability Engineer at Zeta, the place she has spent the previous 4 years constructing and scaling dependable, high-performance methods. With deep experience in OpenSearch, observability platforms, and large-scale infrastructure, she focuses on making certain uptime, efficiency, and operational effectivity. Anchal is enthusiastic about fixing advanced reliability challenges and sharing sensible insights with the engineering group.
Manochandra (Mano) is the Web site Reliability Engineering (SRE) knowledgeable at Zeta, specializing in knowledge management-oriented methods. With a deep understanding of large-scale distributed architectures, he has intensive expertise designing, deploying, and sustaining resilient, production-grade OpenSearch methods. Mano is thought for his proactive method in optimizing infrastructure reliability and efficiency, in addition to his capability to troubleshoot advanced operational challenges. His experience spans implementing automation, monitoring, and incident administration greatest practices, making him a go-to useful resource for making certain service availability and scalability at Zeta.
Hitesh Subnani is a FSI Options Architect at AWS India, the place he works with clients to design and construct architectures that ship enterprise worth. He focuses on complete observability and analytics methods, enabling organizations to realize deep insights from operational knowledge. With experience in search and analytics applied sciences, Hitesh focuses on scalable monitoring methods, real-time dashboards, and compliance-driven architectures for AWS clients within the monetary sector.
Tarun Chakraborty is a Sr. Technical Account Supervisor (TAM) at AWS India, the place he companions with main banks and fintech organizations to speed up their cloud transformation journeys. With over 15 years of expertise in know-how and monetary providers, he serves as a trusted advisor serving to clients leverage AWS’s complete suite of providers to drive innovation and obtain their enterprise aims.