Amazon OpenSearch Service securely unlocks real-time search, monitoring, and evaluation of enterprise and operational information to be used circumstances like utility monitoring, log analytics, observability, and web site search.
On this submit, we look at the OR1 occasion kind, an OpenSearch optimized occasion launched on November 29, 2023.
OR1 is an occasion kind for Amazon OpenSearch Service that gives an economical solution to retailer massive quantities of knowledge. A site with OR1 cases makes use of Amazon Elastic Block Retailer (Amazon EBS) volumes for major storage, with information copied synchronously to Amazon Easy Storage Service (Amazon S3) because it arrives. OR1 cases present elevated indexing throughput with excessive sturdiness.
To study extra about OR1, see the introductory weblog submit.
Whereas actively writing to an index, we suggest that you just maintain one reproduction. Nevertheless, you possibly can change to zero replicas after a rollover and the index is not being actively written.
This may be accomplished safely as a result of the information is continued in Amazon S3 for sturdiness.
Be aware that in case of a node failure and alternative, your information will probably be robotically restored from Amazon S3, however could be partially unavailable through the restore operation, so you shouldn’t contemplate it for circumstances the place searches on non-actively written indices require excessive availability.
Purpose
On this weblog submit, we’ll discover how OR1 impacts the efficiency of OpenSearch workloads.
By offering phase replication, OR1 cases save CPU cycles by indexing solely on the first shards. By doing that, the nodes are in a position to index extra information with the identical quantity of compute, or to make use of fewer sources for indexing and thus have extra accessible for search and different operations.
For this submit, we’re going to think about an indexing-heavy workload and do some efficiency testing.
Historically, Amazon Elastic Compute Cloud (Amazon EC2) R6g cases are a excessive performant selection for indexing-heavy workloads, counting on Amazon EBS storage. Im4gn cases present native NVMe SSD for top throughput and low latency disk writes.
We are going to evaluate OR1 indexing efficiency relative to those two occasion sorts, specializing in indexing efficiency just for scope of this weblog.
Setup
For our efficiency testing, we arrange a number of elements, as proven within the following determine:
For the testing course of:
The index mapping, which is a part of our initialization step, is as follows:
As you possibly can see, we’re utilizing a information stream to simplify the rollover configuration and maintain the utmost major shard measurement below 50 GiB, as per greatest practices.
We optimized the mapping to keep away from any pointless indexing exercise and use the flat_object subject kind to keep away from subject mapping explosion.
For reference, the Index State Administration (ISM) coverage we used is as follows:
Our common doc measurement is 1.6 KiB and the majority measurement is 4,000 paperwork per bulk, which makes roughly 6.26 MiB per bulk (uncompressed).
Testing protocol
The protocol parameters are as follows:
- Variety of information nodes: 6 or 12
- Jobs parallelism: 75, 40
- Main shard depend: 12, 48, 96 (for 12 nodes)
- Variety of replicas: 1 (whole of two copies)
- Occasion sorts (every with 16 vCPUs):
- or1.4xlarge.search
- r6g.4xlarge.search
- im4gn.4xlarge.search
Cluster | Occasion kind | vCPU | RAM | JVM measurement |
or1-target | or1.4xlarge.search | 16 | 128 | 32 |
im4gn-target | im4gn.4xlarge.search | 16 | 64 | 32 |
r6g-target | r6g.4xlarge.search | 16 | 128 | 32 |
Be aware that the im4gn cluster has half the reminiscence of the opposite two, however nonetheless every surroundings has the identical JVM heap measurement of roughly 32 GiB.
Efficiency testing outcomes
For the efficiency testing, we began with 75 parallel jobs and 750 batches of 4,000 paperwork per consumer (a complete 225 million paperwork). We then adjusted the variety of shards, information nodes, replicas, and jobs.
Configuration 1: 6 information nodes, 12 major shards, 1 reproduction
For this configuration, we used 6 information nodes, 12 major shards, and 1 reproduction, we noticed the next efficiency:
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 65-80% | 24 min | 156 kdoc/s | 243 MiB/s |
im4gn-target | 89-97% | 34 min | 110 kdoc/s | 172 MiB/s |
r6g-target | 88-95% | 34 min | 110 kdoc/s | 172 MiB/s |
Highlighted on this desk, im4gn and r6g clusters have very excessive CPU utilization, triggering admission management, which rejects doc.
The OR1 reveals a CPU under 80 p.c sustained, which is an excellent goal.
Issues to bear in mind:
- In manufacturing, don’t neglect to retry indexing with exponential backoff to keep away from dropping unindexed paperwork due to intermittent rejections.
- The majority indexing operation returns 200 OK however can have partial failures. The physique of the response have to be checked to validate that each one the paperwork have been listed efficiently.
By lowering the variety of parallel jobs from 75 to 40, whereas sustaining 750 batches of 4,000 paperwork per consumer (whole 120M paperwork), we get the next:
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 25-60% | 20 min | 100 kdoc/s | 156 MiB/s |
im4gn-target | 75-93% | 19 min | 105 kdoc/s | 164 MiB/s |
r6g-target | 77-90% | 20 min | 100 kdoc/s | 156 MiB/s |
The throughput and CPU utilization decreased, however the CPU stays excessive on Im4gn and R6g, whereas the OR1 is displaying extra CPU capability to spare.
Configuration 2: 6 information nodes, 48 major shards, 1 reproduction
For this configuration, we elevated the variety of major shards from 12 to 48, which offers extra parallelism for indexing:
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 60-80% | 21 min | 178 kdoc/s | 278 MiB/s |
im4gn-target | 67-95% | 34 min | 110 kdoc/s | 172 MiB/s |
r6g-target | 70-88% | 37 min | 101 kdoc/s | 158 MiB/s |
The indexing throughput elevated for the OR1, however the Im4gn and R6g didn’t see an enchancment as a result of their CPU utilization remains to be very excessive.
Lowering the parallel jobs to 40 and holding 48 major shards, we will see that the OR1 will get somewhat extra strain because the minimal CPU will increase from 12 major shards, and the CPU for R6g seems significantly better. For the Im4gn nonetheless, the CPU remains to be excessive.
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 40-60% | 16 min | 125 kdoc/s | 195 MiB/s |
im4gn-target | 80-94% | 18 min | 111 kdoc/s | 173 MiB/s |
r6g-target | 70-80% | 21 min | 95 kdoc/s | 148 MiB/s |
Configuration 3: 12 information nodes, 96 major shards, 1 reproduction
For this configuration, we began with the unique configuration and added extra compute capability, transferring from 6 nodes to 12 and growing the variety of major shards to 96.
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 40-60% | 18 min | 208 kdoc/s | 325 MiB/s |
im4gn-target | 74-90% | 20 min | 187 kdoc/s | 293 MiB/s |
r6g-target | 60-78% | 24 min | 156 kdoc/s | 244 MiB/s |
The OR1 and the R6g are performing properly with CPU utilization under 80 p.c, with OR1 giving 33 p.c higher efficiency with 30 p.c much less CPU utilization in comparison with R6g.
The Im4gn remains to be at 90 p.c CPU, however the efficiency can be superb.
Lowering the variety of parallel jobs from 75 to 40, we get:
Cluster | CPU utilization | Time taken | Indexing pace | |
or1-target | 40-60% | 11 min | 182 kdoc/s | 284 MiB/s |
im4gn-target | 70-90% | 11 min | 182 kdoc/s | 284 MiB/s |
r6g-target | 60-77% | 12 min | 167 kdoc/s | 260 MiB/s |
Lowering the variety of parallel jobs to 40 from 75 introduced the OR1 and Im4gn cases on par and the R6g very shut.
Interpretation
The OR1 cases pace up indexing as a result of solely the first shards have to be written whereas the reproduction is produced by copying segments. Whereas being extra performant in comparison with Img4n and R6g cases, the CPU utilization can be decrease, which supplies room for added load (search) or cluster measurement discount.
We will evaluate a 6-node OR1 cluster with 48 major shards, indexing at 178 thousand paperwork per second, to a 12-node Im4gn cluster with 96 major shards, indexing at 187 thousand paperwork per second or to a 12-node R6g cluster with 96 major shards, indexing at 156 thousand paperwork per second.
The OR1 performs virtually in addition to the bigger Im4gn cluster, and higher than the bigger R6g cluster.
The way to measurement when utilizing OR1 cases
As you possibly can see within the outcomes, OR1 cases can course of extra information at increased throughput charges. Nevertheless, when growing the variety of major shards, they don’t carry out as properly due to the distant backed storage.
To get the very best throughput from the OR1 occasion kind, you need to use bigger batch sizes than regular, and use an Index State Administration (ISM) coverage to roll over your index based mostly on measurement in an effort to successfully restrict the variety of major shards per index. You may as well enhance the variety of connections as a result of the OR1 occasion kind can deal with extra parallelism.
For search, OR1 doesn’t immediately affect the search efficiency. Nevertheless, as you possibly can see, the CPU utilization is decrease on OR1 cases than on Im4gn and R6g cases. That allows both extra exercise (search and ingest), or the chance to scale back the occasion measurement or depend, which might lead to a value discount.
Conclusion and suggestions for OR1
The brand new OR1 occasion kind offers you extra indexing energy than the opposite occasion sorts. That is essential for indexing-heavy workloads, the place you index in batch day by day or have a excessive sustained throughput.
The OR1 occasion kind additionally permits price discount as a result of their worth for efficiency is 30 p.c higher than present occasion sorts. When including multiple reproduction, worth for efficiency will lower as a result of the CPU is barely impacted on an OR1 occasion, whereas different occasion sorts would have indexing throughput lower.
Try the whole directions for optimizing your workload for indexing utilizing this repost article.
In regards to the writer
Cédric Pelvet is a Principal AWS Specialist Options Architect. He helps clients design scalable options for real-time information and search workloads. In his free time, his actions are studying new languages and training the violin.