OpenSearch is an open supply, distributed search engine appropriate for a wide selection of use-cases corresponding to ecommerce search, enterprise search (content material administration search, doc search, information administration search, and so forth), web site search, utility search, and semantic search. It’s additionally an analytics suite that you need to use to carry out interactive log analytics, real-time utility monitoring, safety analytics and extra. Like Apache Solr, OpenSearch gives search throughout doc units. OpenSearch additionally contains capabilities to ingest and analyze knowledge. Amazon OpenSearch Service is a completely managed service that you need to use to deploy, scale, and monitor OpenSearch within the AWS Cloud.
Many organizations are migrating their Apache Solr based mostly search options to OpenSearch. The primary driving components embody decrease complete price of possession, scalability, stability, improved ingestion connectors (corresponding to Information Prepper, Fluent Bit, and OpenSearch Ingestion), elimination of exterior cluster managers like Zookeeper, enhanced reporting, and wealthy visualizations with OpenSearch Dashboards.
We suggest approaching a Solr to OpenSearch migration with a full refactor of your search answer to optimize it for OpenSearch. Whereas each Solr and OpenSearch use Apache Lucene for core indexing and question processing, the methods exhibit completely different traits. By planning and working a proof-of-concept, you’ll be able to guarantee one of the best outcomes from OpenSearch. This weblog publish dives into the strategic issues and steps concerned in migrating from Solr to OpenSearch.
Key variations
Solr and OpenSearch Service share elementary capabilities delivered by Apache Lucene. Nevertheless, there are some key variations in terminology and performance between the 2:
- Assortment and index: In OpenSearch, a set known as an index.
- Shard and duplicate: Each Solr and OpenSearch use the phrases shard and duplicate.
- API-driven Interactions: All interactions in OpenSearch are API-driven, eliminating the necessity for guide file adjustments or Zookeeper configurations. When creating an OpenSearch index, you outline the mapping (equal to the schema) and the settings (equal to solrconfig) as a part of the index creation API name.
Having set the stage with the fundamentals, let’s dive into the 4 key elements and the way every of them may be migrated from Solr to OpenSearch.
Assortment to index
A set in Solr known as an index in OpenSearch. Like a Solr assortment, an index in OpenSearch additionally has shards and replicas.
Though the shard and duplicate idea is analogous in each the various search engines, you need to use this migration as a window to undertake a greater sharding technique. Measurement your OpenSearch shards, replicas, and index by following the shard technique greatest practices.
As a part of the migration, rethink your knowledge mannequin. In inspecting your knowledge mannequin, you could find efficiencies that dramatically enhance your search latencies and throughput. Poor knowledge modeling doesn’t solely end in search efficiency issues however extends to different areas. For instance, you may discover it difficult to assemble an efficient question to implement a specific function. In such instances, the answer usually entails modifying the information mannequin.
Variations: Solr permits major shard and duplicate shard collocation on the identical node. OpenSearch doesn’t place the first and duplicate on the identical node. OpenSearch Service zone consciousness can robotically make sure that shards are distributed to completely different Availability Zones (knowledge facilities) to additional enhance resiliency.
The OpenSearch and Solr notions of duplicate are completely different. In OpenSearch, you outline a major shard rely utilizing number_of_primaries
that determines the partitioning of your knowledge. You then set a reproduction rely utilizing number_of_replicas
. Every duplicate is a duplicate of all the first shards. So, in case you set number_of_primaries
to five, and number_of_replicas
to 1, you’ll have 10 shards (5 major shards, and 5 duplicate shards). Setting replicationFactor=1
in Solr yields one copy of the information (the first).
For instance, the next creates a set referred to as check with one shard and no replicas.
In OpenSearch, the next creates an index referred to as check with 5 shards and one duplicate
Schema to mapping
In Solr schema.xml
OR managed-schema
has all the sphere definitions, dynamic fields, and duplicate fields together with subject kind (textual content analyzers, tokenizers, or filters). You employ the schema API to handle schema. Or you’ll be able to run in schema-less mode.
OpenSearch has dynamic mapping, which behaves like Solr in schema-less mode. It’s not essential to create an index beforehand to ingest knowledge. By indexing knowledge with a brand new index identify, you create the index with OpenSearch managed service default settings (for instance: "number_of_shards": 5, "number_of_replicas": 1
) and the mapping based mostly on the information that’s listed (dynamic mapping).
We strongly suggest you go for a pre-defined strict mapping. OpenSearch units the schema based mostly on the primary worth it sees in a subject. If a stray numeric worth is the primary worth for what can be a string subject, OpenSearch will incorrectly map the sphere as numeric (integer
, for instance). Subsequent indexing requests with string values for that subject will fail with an incorrect mapping exception. You already know your knowledge, you already know your subject varieties, you’ll profit from setting the mapping straight.
Tip: Take into account performing a pattern indexing to generate the preliminary mapping after which refine and tidy up the mapping to precisely outline the precise index. This strategy helps you keep away from manually setting up the mapping from scratch.
For Observability workloads, it is best to think about using Easy Schema for Observability. Easy Schema for Observability (also called ss4o) is a normal for conforming to a typical and unified observability schema. With the schema in place, Observability instruments can ingest, robotically extract, and mixture knowledge and create customized dashboards, making it simpler to grasp the system at the next stage.
Most of the subject varieties (knowledge varieties), tokenizers, and filters are the identical in each Solr and OpenSearch. In any case, each use Lucene’s Java search library at their core.
Let’s have a look at an instance:
Notable issues in OpenSearch in comparison with Solr:
- _id is all the time the uniqueKey and can’t be outlined explicitly, as a result of it’s all the time current.
- Explicitly enabling
multivalued
isn’t crucial as a result of any OpenSearch subject can comprise zero or extra values. - The mapping and the analyzers are outlined throughout index creation. New fields may be added and sure mapping parameters may be up to date later. Nevertheless, deleting a subject isn’t doable. A useful ReIndex API can overcome this drawback. You should use the Reindex API to index knowledge from one index to a different.
- By default, analyzers are for each index and question time. For some less-common situations, you’ll be able to change the question analyzer at search time (within the question itself), which is able to override the analyzer outlined within the index mapping and settings.
- Index templates are additionally a good way to initialize new indexes with predefined mappings and settings. For instance, in case you constantly index log knowledge (or any time-series knowledge), you’ll be able to outline an index template so that every one the indices have the identical variety of shards and replicas. It can be used for dynamic mapping management and element templates
Search for alternatives to optimize the search answer. As an illustration, if the evaluation reveals that town subject is solely used for filtering slightly than looking out, take into account altering its subject kind to key phrase as an alternative of textual content to get rid of pointless textual content processing. One other optimization may contain disabling doc_values for the user_token
subject if it’s solely meant for show functions. doc_values
are disabled by default for the textual content datatype.
SolrConfig to settings
In Solr, solrconfig.xml
carries the gathering configuration. All types of configurations pertaining to every part from index location and formatting, caching, codec manufacturing facility, circuit breaks, commits and tlogs all the way in which as much as gradual question config, request handlers, and replace processing chain, and so forth.
Let’s have a look at an instance:
Notable issues in OpenSearch in comparison with Solr:
- Each OpenSearch and Solr have
BEST_SPEED
codec as default (LZ4 compression algorithm). Each provideBEST_COMPRESSION
as a substitute. Moreover OpenSearch giveszstd
andzstd_no_dict
. Benchmarking for various compression codecs can also be obtainable. - For close to real-time search,
refresh_interval
must be set. The default is 1 second which is sweet sufficient for many use instances. We suggest growingrefresh_interval
to 30 or 60 seconds to enhance indexing velocity and throughput, particularly for batch indexing. - Max boolean clause is a static setting, set at node stage utilizing the
indices.question.bool.max_clause_count
setting. - You don’t want an express requestHandler. All searches use the
_search
or_msearch
endpoint. For those who’re used to utilizing the requestHandler with default values then you need to use search templates. - For those who’re used to utilizing
/sql
requestHandler, OpenSearch additionally enables you to use SQL syntax for querying and has a Piped Processing Language. - Spellcheck, also called Did-you-mean, QueryElevation (often known as
pinned_query
in OpenSearch), and highlighting are all supported throughout question time. You don’t have to explicitly outline search elements. - Most API responses are restricted to JSON format, with CAT APIs as the one exception. In instances the place Velocity or XSLT is utilized in Solr, it have to be managed on the appliance layer. CAT APIs reply in JSON, YAML, or CBOR codecs.
- For the
updateRequestProcessorChain
, OpenSearch gives the ingest pipeline, permitting the enrichment or transformation of information earlier than indexing. A number of processor phases may be chained to type a pipeline for knowledge transformation. Processors embody GrokProcessor, CSVParser, JSONProcessor, KeyValue, Rename, Cut up, HTMLStrip, Drop, ScriptProcessor, and extra. Nevertheless, it’s strongly beneficial to do the information transformation outdoors OpenSearch. The perfect place to do this could be at OpenSearch Ingestion, which gives a correct framework and numerous out-of-the-box filters for knowledge transformation. OpenSearch Ingestion is constructed on Information Prepper, which is a server-side knowledge collector able to filtering, enriching, remodeling, normalizing, and aggregating knowledge for downstream analytics and visualization. - OpenSearch additionally launched search pipelines, much like ingest pipelines however tailor-made for search time operations. Search pipelines make it simpler so that you can course of search queries and search outcomes inside OpenSearch. At present obtainable search processors embody filter question, neural question enricher, normalization, rename subject, scriptProcessor, and personalize search rating, with extra to return.
- The next picture exhibits the right way to set
refresh_interval
andslowlog
. It additionally exhibits you the opposite doable settings. - Sluggish logs may be set like the next picture however with rather more precision with separate thresholds for the question and fetch phases.
Earlier than migrating each configuration setting, assess if the setting may be adjusted based mostly in your present search system expertise and greatest practices. As an illustration, within the previous instance, the gradual logs threshold of 1 second is perhaps intensive for logging, so that may be revisited. In the identical instance, max.booleanClauses
is perhaps one other factor to take a look at and scale back.
Variations: Some settings are accomplished on the cluster stage or node stage and never on the index stage. Together with settings corresponding to max boolean clause, circuit breaker settings, cache settings, and so forth.
Rewriting queries
Rewriting queries deserves its personal weblog publish; nonetheless we need to at the very least showcase the autocomplete function obtainable in OpenSearch Dashboards, which helps ease question writing.
Much like the Solr Admin UI, OpenSearch additionally contains a UI referred to as OpenSearch Dashboards. You should use OpenSearch Dashboards to handle and scale your OpenSearch clusters. Moreover, it gives capabilities for visualizing your OpenSearch knowledge, exploring knowledge, monitoring observability, working queries, and so forth. The equal for the question tab on the Solr UI in OpenSearch Dashboard is Dev Instruments. Dev Instruments is a improvement surroundings that allows you to arrange your OpenSearch Dashboards surroundings, run queries, discover knowledge, and debug issues.
Now, let’s assemble a question to perform the next:
- Seek for
shirt OR shoe
in an index. - Create a aspect question to seek out the variety of distinctive clients. Aspect queries are referred to as aggregation queries in OpenSearch. Also called
aggs
question.
The Solr question would seem like this:
The picture beneath demonstrates the right way to re-write the above Solr question into an OpenSearch question DSL:
Conclusion
OpenSearch covers all kinds of makes use of instances, together with enterprise search, web site search, utility search, ecommerce search, semantic search, observability (log observability, safety analytics (SIEM), anomaly detection, hint analytics), and analytics. Migration from Solr to OpenSearch is turning into a typical sample. This weblog publish is designed to be a place to begin for groups in search of steering on such migrations.
You’ll be able to check out OpenSearch with the OpenSearch Playground. You’ll be able to get began with Amazon OpenSearch Service, a managed implementation of OpenSearch within the AWS Cloud.
Concerning the Authors
Aswath Srinivasan is a Senior Search Engine Architect at Amazon Internet Companies at present based mostly in Munich, Germany. With over 17 years of expertise in numerous search applied sciences, Aswath at present focuses on OpenSearch. He’s a search and open-source fanatic and helps clients and the search group with their search issues.
Jon Handler is a Senior Principal Options Architect at Amazon Internet Companies based mostly in Palo Alto, CA. Jon works intently with OpenSearch and Amazon OpenSearch Service, offering assist and steering to a broad vary of consumers who’ve search and log analytics workloads that they need to transfer to the AWS Cloud. Previous to becoming a member of AWS, Jon’s profession as a software program developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the College of Pennsylvania, and a Grasp of Science and a PhD in Pc Science and Synthetic Intelligence from Northwestern College.