How AppZen enhances operational effectivity, scalability, and safety with Amazon OpenSearch Serverless

AppZen is a number one supplier of AI-driven finance automation options. The corporate’s core providing facilities round an revolutionary AI platform designed for contemporary finance groups, that includes expense administration, fraud detection, and autonomous accounts payable options. AppZen’s expertise stack makes use of pc imaginative and prescient, deep studying, and pure language processing (NLP) to automate monetary processes and guarantee compliance. With this complete resolution strategy, AppZen has a well-established enterprise buyer base that features one-third of the Fortune 500 corporations.

AppZen hosts all its workloads and utility infrastructure on Amazon Internet Providers (AWS), repeatedly modernizing its expertise stack to successfully operationalize and host its functions. Centralized logging, a vital part of this infrastructure, is important for monitoring and managing operations throughout AppZen’s numerous workloads. As the corporate skilled fast development, the legacy logging resolution struggled to maintain tempo with increasing wants. Consequently, modernizing this method turned one in every of AppZen’s prime priorities, prompting a complete overhaul to boost operational effectivity and scalability.

On this weblog we present, how AppZen modernizes its central log analytics resolution from Elasticsearch to Amazon OpenSearch Serverless offering an optimized structure to satisfy above talked about necessities.

Challenges with the legacy logging resolution

With a rising variety of enterprise functions and workloads, AppZen had an growing want for complete operational analytics utilizing log information throughout its multi-account group in AWS Organizations. AppZen’s legacy logging resolution created a number of key challenges. It lacked the flexibleness and scalability to effectively index and make the logs out there for real-time evaluation, which was essential for monitoring anomalies, optimizing workloads, and guaranteeing environment friendly operations.

The legacy logging resolution consisted of a 70-node Elasticsearch cluster (with 30 scorching nodes and 40 heat nodes), it struggled to maintain up with the rising quantity of log information as AppZen’s buyer base expanded and new mission-critical workloads had been added. This led to efficiency points and elevated operational complexity. Sustaining and managing the self-hosted Elasticsearch cluster required frequent software program updates and infrastructure patching, leading to system downtime, information loss, and added operational overhead for the AppZen CloudOps crew.

Migrating the info to a patched node cluster took 7 days, far exceeding business normal and AppZen’s operational necessities. This prolonged downtime launched information integrity threat and straight impacted the operational availability of the centralized logging system essential for groups to troubleshoot throughout vital workloads. The system additionally suffered frequent information loss that impacted real-time metrics monitoring, dashboarding, and alerting as a result of its utility log-collecting agent Fluent Bit lacked important options equivalent to backoff and retry.

AppZen has an NGINX proxy occasion controlling licensed person entry to information hosted on Elasticsearch. Upgrades and patching of the occasion launched frequent system downtimes. All person requests are routed by means of this proxy layer, the place the person’s permission boundary is evaluated. This had an added operations overhead for directors to handle customers and group mapping on the proxy layer.

Answer overview

AppZen re-platformed its central log analytics resolution with Amazon OpenSearch Serverless and Amazon OpenSearch Ingestion. Amazon OpenSearch Serverless helps you to run OpenSearch within the AWS Cloud, so you may run massive workloads with out configuring, managing, and scaling OpenSearch clusters. You possibly can ingest, analyze, and visualize your time-series information with out infrastructure provisioning. OpenSearch Ingestion is a completely managed information collector that simplifies information processing with built-in capabilities to filter, rework, and enrich your logs earlier than evaluation.

This new serverless structure, proven within the following structure diagram, is cost-optimized, safe, high-performing, and designed to scale effectively for future enterprise wants. It serves the next use circumstances:

Centrally monitor enterprise operations and information evaluation for deep insights
Software monitoring and infrastructure troubleshooting

Collectively, OpenSearch Ingestion and OpenSearch Serverless present a serverless infrastructure able to working massive workloads with out configuring, managing, and scaling the cluster. It gives information resilience with persistent buffers that may help the present 2 TB per day pipeline information ingestion requirement. IAM Id Middle help for OpenSearch Serverless helped handle customers and their entry centrally eliminating a necessity for NGINX proxy layer.

The structure diagram additionally exhibits how separate ingestion pipelines had been deployed. This configuration possibility improves deployment flexibility based mostly on the workload’s throughput and latency necessities. On this structure, Circulation-1 is a push-based information supply (equivalent to HTTP and OTel logs) the place the workload’s Fluent Bit DaemonSet is configured to ingest log messages into the OpenSearch Ingestion pipeline. These messages are retained within the pipeline’s persistent buffer to supply information sturdiness. After processing the message, it’s inserted into OpenSearch Serverless.

And Circulation-2 is a pull-based information supply equivalent to Amazon Easy Storage Service (Amazon S3) for OpenSearch Ingestion the place the workload’s Fluent Bit DaemonSets are configured to sync information to an S3 bucket. Utilizing S3 Occasion Notifications, the brand new log information creation notifications are despatched to Amazon Easy Queue Service (Amazon SQS). OpenSearch Ingestion consumes this notification and processes the report to insert into OpenSearch Serverless, delegating the info sturdiness to the info supply. For each Circulation-1 and Circulation-2, the OpenSearch Ingestion pipelines are configured with a dead-letter queue to report failed ingestion messages to the S3 supply, making them accessible for additional evaluation.

AWS logging architecture with ingestion flows to OpenSearch Serverless

For service log analytics, AppZen adopted a pull-based strategy as proven within the following determine, the place all service logs revealed to Amazon CloudWatch are migrated an S3 bucket for additional processing. An AWS Lambda processor is triggered when each new message is ingested to the S3 bucket, and the processed message is then uploaded to the S3 bucket for OpenSearch ingestion. The next diagram exhibits the OpenSearch Serverless structure for the service log analytics pipeline.

A log ingestion architecture for service log analytics

Workloads and infrastructure unfold throughout a number of AWS accounts can securely ship logs to the central log analytics platform over a non-public community utilizing digital non-public cloud (VPC) peering and AWS PrivateLink endpoints, as proven within the following determine. Each OpenSearch Ingestion and OpenSearch Serverless are provisioned in the identical account and Area, with cross-account ingestion enabled for workloads in different member accounts of the AWS Organizations account.

Cross-account AWS logging with secure centralized collection

Migration strategy

The migration to OpenSearch Serverless and OpenSearch Ingestion concerned efficiency analysis and fine-tuning the configuration of the logging stack, adopted by migration of manufacturing visitors to new platform. Step one was to configure and benchmark the infrastructure for cost-optimized efficiency.

Parallel ingestion to benchmark OCU capability necessities

OpenSearch Ingestion scales elastically to satisfy throughput necessities throughout workload spikes. Enabling persistent buffering on ingestion pipelines with push-based information sources supplied information sturdiness and reliability. Knowledge ingestion pipelines are ingesting at a fee of two TB per day. Resulting from AppZen’s 90-day information retention requirement round its ingested information, at any time, there’s roughly 200 TB of listed historic information saved within the OpenSearch Serverless cluster. To judge efficiency and prices earlier than deploying to manufacturing, information sources had been configured to ingest information in parallel into the brand new OpenSearch Serverless setting together with an current setup already working in manufacturing with Elasticsearch.

To attain parallel ingestion, AppZen put in one other Fluent Bit DaemonSet configured to ingest into the brand new pipeline. This was for 2 causes: 1) To keep away from interruption as a result of adjustments to current ingestion move and a pair of) New workflows are rather more simple when the info preprocessing step is offloaded to OpenSearch Ingestion, eliminating the necessity for customized lua script use in Fluent Bit.

Pipeline configuration

The manufacturing pipeline configuration was carried out with completely different methods based mostly on information supply varieties. Push-based information sources had been configured with persistent buffer enabled for information sturdiness and a minimal of three OpenSearch Compute Models (OCUs) to supply excessive availability throughout three Availability Zones. In distinction, pull-based information sources, which used Amazon S3 as their supply, didn’t require persistent buffering as a result of inherent sturdiness options of Amazon S3. Each pipeline varieties had been initially configured with a minimal of three OCUs and a most of fifty OCUs to ascertain baseline efficiency metrics. This setup meant the crew may monitor and analyze precise workload patterns, and due to this fact fine-tune employee configurations for optimum OCU utilization. By steady monitoring and adjustment, the pipeline configurations had been modified and optimized to effectively deal with each day by day common hundreds and peak visitors intervals, offering cost-effective and dependable information processing operations.

For AppZen’s throughput requirement, within the pull-based strategy, they recognized six Amazon S3 staff within the OpenSearch Ingestion pipelines optimally processing 1 OCU at 80% effectivity. Following the finest practices advice, at this system.cpu.utilization.worth metrics threshold, the pipeline was configured to auto scale. With every employee able to processing 10 messages, AppZen recognized cost-optimized configuration of fifty OCUs as most OCU configuration for its pipelines that’s able to processing as much as 3,000 messages in parallel. This pipeline configuration proven under helps its peak throughput necessities

# That is an OpenSearch Ingestion - pipeline configuration for processing Kubernetes logs and sending them to OpenSearch Serverless
# Knowledge Circulation: S3 -> SQS -> OpenSearch Ingestion -> OpenSearch + S3 Archive
# index_name right here is kubernetes.namespace_name or k8 service identify
# If k8 Index identify is dev: Service1-dev
# If k8 Index identify is non-dev: Service1-allenv
model: "2"
entry-pipeline:
  # Supply (S3 + SQS)
  # Reads logs from S3 bucket by way of SQS notifications
  # 6 staff course of JSON information. Deletes S3 objects after processing
  supply:
    s3:
      staff: 6
      notification_type: "sqs"
      codec:
        ndjson:
      compression: "none"
      aws:
        area: "us-east-1"
        sts_role_arn: "<roleArn>"
      acknowledgments: true
      delete_s3_objects_on_read: true
      sqs:
        queue_url: "https://sqs.us-east-1.amazonaws.com/********1234/us-s3-k8-log"
        visibility_duplication_protection: true
  # Processing Pipeline
  # Timestamp: Provides @timestamp from ingestion time
  # Index naming: Units index_name from Kubernetes namespace
  processor:
    - date:
        from_time_received: true
        vacation spot: "@timestamp"
    - add_entries:
        entries:
        - key: "index_name"
          value_expression: "/kubernetes_namespace/identify"
          add_when: "/index_name == null"
    - delete_entries:
        with_keys: [ "tmp" ]
    
    # JSON parsing: Parses nested JSON in log and message fields
    # Failed JSON parsing skipped silently
    - parse_json:
        supply: /log
        handle_failed_events: 'skip_silently'
    - parse_json:
        supply: /message
        handle_failed_events: 'skip_silently'
    
    # Surroundings detection: Makes use of grok patterns to extract setting from namespace names
    - grok:
        grok_when: 'comprises(/index_name, "prod-") or comprises(/index_name, "prod-k1-") or comprises(/index_name, " prod-k2-")'
        match:
          index_name:
            - '%{WORD:prefix}-%{GREEDYDATA:suffix}-%{INT:ignore}'
            - '%{WORD:prefix}-%{GREEDYDATA:suffix}'
    - add_entries:
        entries:
        - key: "/suffix"
          value_expression: "/index_name"
          add_when: "/suffix == null"
        - key: "/labels/setting"
          value_expression: "/prefix"
          add_when: "/prefix != null"
          overwrite_if_key_exists: true
        - key: "/labels/setting"
          value_expression: "/labels_environment"
          add_when: "/labels_environment != null"
          overwrite_if_key_exists: true
  # Routing Logic 
  # k8: Regular Kubernetes logs
  # k8-debug: DEBUG stage logs (separate retention)
  # unknown: Logs with out correct metadata
  routes:
    - k8: '/kubernetes_namespace/identify != null or /data_source == "kubernetes"'
    - k8-debug: '/data_source == "kubernetes" and /levelname == "DEBUG"'
    - unknown: '/kubernetes_namespace/identify == null and /suffix == null and /log_group == null'
  # Sinks (3 locations)
  # S3 Archive: All logs saved in S3 with date partitioning
  # OpenSearch (Regular): ${suffix}-v4-k8 index for normal logs
  # OpenSearch (Debug): ${suffix}-v4-k8-debug index for debug logs
  sink:
    - s3:
        aws:
          area: "us-east-1"
          sts_role_arn: "<roleArn>"
        bucket: <logS3Bucket>
        object_key:
          path_prefix: 'us/${getMetadata("s3-prefix")}/%{yyyy}/%{MM}/%{dd}/'
        codec:
          json:
        compression: "none"
        threshold:
          maximum_size: 20mb
          event_collect_timeout: PT10M
    - opensearch:
        hosts: ["https://<AossDomainUrl>"]
        index: "${/suffix}-v4-k8"
        index_type: customized
        # Max 15 retries for OpenSearch operations
        max_retries: 15
        aws:
          # IAM position that the pipeline assumes to entry the area sink
          sts_role_arn: "<roleArn>"
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        # Error Dealing with:
        # Useless Letter Queue (DLQ) to S3 for failed OpenSearch writes
        dlq:
          s3:
            bucket: "<dlqS3Bucket>"
            key_path_prefix: "/k8/"
            area: "us-east-1"
            sts_role_arn: "<roleArn>"
        routes:
          - k8
    - opensearch:
        hosts: ["https://<AossDomainUrl>"]
        index: "${/suffix}-v4-k8-debug"
        index_type: customized
        max_retries: 15
        aws:
          # IAM position that the pipeline assumes to entry the area sink
          sts_role_arn: "<roleArn>"
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        dlq:
          s3:
            bucket: "<dlqS3Bucket>"
            key_path_prefix: "/k8-debug/"
            area: "us-east-1"
            sts_role_arn: "<roleArn>"
        routes:
          - k8-debug
    - opensearch:
        hosts: ["https://<AossDomainUrl>"]
        index: "unknown"
        index_type: customized
        max_retries: 15
        aws:
          # IAM position that the pipeline assumes to entry the area sink
          sts_role_arn: "<roleArn>"
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        dlq:
          s3:
            bucket: "<dlqS3Bucket>"
            key_path_prefix: "/unknown/"
            area: "us-east-1"
            sts_role_arn: "<roleArn>"
        routes:
          - unknown

Indexing technique

When working with search engine, understanding index and shard administration is essential. Indexes and their corresponding shards eat reminiscence and CPU sources to keep up metadata. A key problem emerges when having quite a few small shards in a system as a result of it results in increased useful resource consumption and operational overhead. Within the conventional strategy, you usually create indices on the microservice stage for every setting (prod, qa, and dev). For instance, indices can be named like prod-k1-service or prod-k2-service, the place k1 and k2 signify completely different microservices. With a whole bunch of providers and day by day index rotation, this strategy ends in hundreds of indices, making administration advanced and useful resource intensive. When implementing OpenSearch Serverless, you must undertake a consolidated indexing technique that strikes away from microservice-level index creation. Quite than creating particular person indices like prod-k1-service and prod-k2-service for every microservice and setting, you must consolidate the info into broader environment-based indices equivalent to prod-service, which comprises all service information for the manufacturing setting. This consolidation is important as a result of OpenSearch Serverless scales based mostly on sources and has particular limitations on the variety of shards per OCU. Which means having the next variety of small shards will result in increased OCU consumption.

Nevertheless, though this consolidated strategy can considerably cut back operational prices and simplify administration by means of built-in information lifecycle insurance policies, it presents a notable problem for multi-tenant situations. Organizations with strict safety necessities, the place completely different groups want entry to particular indices solely, would possibly discover this consolidated strategy difficult to implement. For such circumstances, a extra granular indices strategy is likely to be needed to keep up correct entry management, despite the fact that it may end up in increased useful resource consumption.

By fastidiously evaluating your safety necessities and entry management wants, you may select between a consolidated strategy for optimized useful resource utilization or a extra granular strategy that higher helps fine-grained entry management. Each approaches are supported in OpenSearch Serverless, so you may stability useful resource optimization with safety necessities based mostly in your particular use case.

Price optimization

OpenSearch Ingestion allocates some OCUs from configured pipeline capability for persistent buffering, which gives information sturdiness. Whereas monitoring, AppZen noticed increased OCU utilization for this persistent buffer when processing high-throughput workloads. To optimize this capability configuration, AppZen determined to categorise its workloads into push-based and pull-based classes relying on their throughput and latency necessities. Reaching this created new parallel pipelines to function these flows in parallel, as proven within the structure diagram earlier within the put up. Fluent Bit agent collector configurations had been accordingly modified based mostly on the workload classification.

Relying on the associated fee and efficiency necessities for the workload, AppZen adopted the suitable ingestion move. For low latency and low-throughput workload necessities, AppZen selected the push-based strategy. For top-throughput workload necessities, AppZen adopted the pull-based strategy, which helped decrease the persistent buffer OCU utilization by counting on sturdiness to the info supply. Within the pull-based strategy, AppZen additional optimized on the storage price by configuring the pipeline to robotically delete the processed information from the S3 bucket after profitable ingestion

Monitoring and dashboard

One of many key design rules for operational excellence within the cloud is to implement observability for actionable insights. This helps achieve a complete understanding of the workloads to assist enhance efficiency, reliability, and the associated fee concerned. Each OpenSearch Serverless and OpenSearch Ingestion publish all metrics and logs information to Amazon CloudWatch. After figuring out key operational OpenSearch Serverless metrics and OpenSearch Service pipeline metrics, AppZen arrange CloudWatch alarms to ship a notification when sure outlined thresholds are met. The next screenshot exhibits the variety of OCUs used to index and search assortment information.

The next screenshot exhibits the variety of Ingestion OCUs in use by the pipeline.

The next screenshot exhibits the share of obtainable CPU utilization for OCU.

The next screenshot exhibits the p.c utilization of buffer based mostly on the variety of information within the buffer.

Conclusion

AppZen efficiently modernized their logging infrastructure by migrating to a serverless structure utilizing Amazon OpenSearch Serverless and OpenSearch Ingestion. By adopting this new serverless resolution, AppZen eradicated an operations overhead that concerned 7 days of information migration effort throughout every quarterly improve and patching cycle of Kubernetes cluster internet hosting Elasticsearch nodes. Additionally, with the serverless strategy, AppZen was in a position to keep away from index mapping conflicts through the use of index templates and a brand new indexing technique. This helped the crew save a mean 5.2 hours per week of operational effort and as a substitute use the time to give attention to different precedence enterprise challenges. AppZen achieved a greater safety posture by means of centralized entry controls with OpenSearch Serverless, eliminating the overhead of managing a reproduction set of person permissions on the proxy layer. The brand new resolution helped AppZen deal with rising information quantity and construct real-time operational analytics whereas optimizing price, enhancing scalability and resiliency. AppZen optimized prices and efficiency by classifying workloads into push-based and pull-based flows, so they may select the suitable ingestion strategy based mostly on latency and throughput necessities.

With this modernized logging resolution, AppZen is nicely positioned to effectively monitor their enterprise operations, carry out in-depth information evaluation, and successfully monitor and troubleshooting the appliance as they proceed to develop. Trying forward, AppZen plans to make use of OpenSearch Serverless as a vector database, incorporating Amazon S3 Vectors, generative AI, and basis fashions (FMs) to boost operational duties utilizing pure language processing.

To implement the same logging resolution to your group, start by exploring AWS documentation on migrating to Amazon OpenSearch Serverless and organising OpenSearch Serverless. For steering on creating ingestion pipelines, consult with the AWS information on OpenSearch Ingestion to start modernizing your logging infrastructure.

In regards to the authors

Prashanth Dudipala is a DevOps Architect at AppZen, the place he helps construct scalable, safe, and automatic cloud platforms on AWS. He’s obsessed with simplifying advanced techniques, enabling groups to maneuver sooner, and sharing sensible insights with the cloud group.

Madhuri Andhale is a DevOps Engineer at AppZen, targeted on constructing and optimizing cloud-native infrastructure. She is obsessed with managing environment friendly CI/CD pipelines, streamlining infrastructure and deployments, modernizing techniques, and enabling growth groups to ship sooner and extra reliably. Outdoors of labor, Madhuri enjoys exploring rising applied sciences, touring to new locations, experimenting with new recipes, and discovering artistic methods to resolve on a regular basis challenges.

Manoj Gupta is a Senior Options Architect at AWS, based mostly in San Francisco. With over 4 years of expertise at AWS, he works carefully with prospects like AppZen to construct optimized cloud architectures. His major focus areas are Knowledge, AI/ML, and Safety, serving to organizations modernize their expertise stacks. Outdoors of labor, he enjoys out of doors actions and touring with household.

Prashant Agrawal is a Sr. Search Specialist Options Architect with Amazon OpenSearch Service. He works carefully with prospects to assist them migrate their workloads to the cloud and helps current prospects fine-tune their clusters to attain higher efficiency and save on price. Earlier than becoming a member of AWS, he helped numerous prospects use OpenSearch and Elasticsearch for his or her search and log analytics use circumstances. When not working, you’ll find him touring and exploring new locations. In brief, he likes doing Eat → Journey → Repeat.

How AppZen enhances operational effectivity, scalability, and safety with Amazon OpenSearch Serverless

Challenges with the legacy logging resolution

Answer overview

Migration strategy

Parallel ingestion to benchmark OCU capability necessities

Pipeline configuration

Indexing technique

Price optimization

Monitoring and dashboard

Conclusion

In regards to the authors

Related Articles

Easy methods to Select a 3PL Companion When Your Enterprise Is Able to Scale

Google Meet provides a brand new trick for if you’re not camera-ready

Governing Agentic AI at Scale with MCP

LEAVE A REPLY Cancel reply

Latest Articles

Easy methods to Select a 3PL Companion When Your Enterprise Is Able to Scale

Google Meet provides a brand new trick for if you’re not camera-ready

Governing Agentic AI at Scale with MCP

How one can run RAG tasks for higher information analytics outcomes

Construct, Discover, and Evolve Your Determination Fashions