Most Recent Articles
Introducing OpenSearch Benchmark 2.0	Aug 21
Diving deep into distributed microservices with OpenSearch and OpenTelemetry	Aug 12
Transforming bucket aggregations: Our journey to 100x performance improvement	Aug 05
Performance optimizations for the OpenSearch security layer	Jul 30
Scaling vector generation: Batch ML inference with OpenSearch Ingestion and M...	Jul 30
OpenSearch User Groups: Building a Global Community	Jul 18
OpenSearch Approximation Framework	Jul 14
The new semantic field: Simplifying semantic search in OpenSearch	Jul 11
Advanced usage of the semantic field in OpenSearch	Jul 10
Making ingestion smarter: System ingest pipelines in OpenSearch	Jul 08

Exploring concurrent segment search performance

Tue, Jul 30, 2024 · Jay Deng, Sorabh Hamirwasia

In October 2023, we introduced concurrent segment search in OpenSearch as an experimental feature. Searching segments concurrently improves search latency across a large variety of workloads. This feature was made generally available in OpenSearch 2.12; we highly recommend that you try it! Here, we’ll share performance results of simulations of different real-world scenarios. In particular, we’ll look at performance trends as available system resources decrease and concurrency increases.

Concurrent segment search divides each shard-level search request on a node into multiple execution tasks called slices. Slices can be executed concurrently on separate threads in the index_searcher thread pool, separately from the search thread pool. Each slice searches within its associated segments. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index_searcher thread pool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index_searcher thread pool has twice as many threads as the number of available processors.

Performance results

For our performance testing, we used a standard r5.8xlarge instance type, which has 32 vCPUs and an index_searcher thread pool size of 64. This allowed us to explore various concurrency and cluster load scenarios on a realistic instance as well as capture a more accurate picture of performance as we increased the number of slices and search clients.

In our previous blog post, we showed that the strongest performance improvements were demonstrated in long-running and CPU-intensive operations, while fast queries like match_all saw little improvement or even a slight performance regression from the overhead of concurrent segment search. Because in this post we want to focus on how performance changes under various cluster load conditions, we will look at a smaller subset of longer-running operations for the nyc_taxis and big5 workloads using OpenSearch Benchmark. Additionally, we’ll report the p90 system metrics for the entire workload provided by OpenSearch Benchmark, which is most relevant to the longer-running operations.

First, we established a performance baseline using a single shard and a single search client. This configuration is the theoretical best-case scenario when using concurrent segment search because in this configuration, there is a maximum of 1 shard-level search request being processed at a time; thus, each request uses the maximum number of resources.

The following sections present the performance results. We abbreviate concurrent segment search as “CS”.

Cluster setup 1

Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
Node count: 1
Shard count: 1
Search client count: 1
Concurrent search thread pool size: 64

p90 query latency comparison

Operation	CS disabled (in ms)	CS enabled (Lucene default slices) (in ms)	% Improvement	CS enabled (fixed slice count=2) (in ms)	% Improvement	CS enabled (fixed slice count=4) (in ms)	% Improvement
`range-auto-date-histo-with-metrics` (`big5`)	21757	4178	81%	11710	46%	6915	68%
`range-auto-date-histo` (`big5`)	7950	1486	81%	4341	45%	2555	67%
`query-string-on-message` (`big5`)	143	49	66%	81	43%	59	58%
`keyword-in-range` (`big5`)	127	58	54%	87	31%	70	44%
`distance_amount_agg` (`nyc_taxis`)	12403	2921	76%	6642	46%	3633	71%

System metrics (`big5`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	3%	36%	6%	12%
p90 JVM	54%	56%	54%	54%
Max `index_searcher` active threads	–	17	2	4

System metrics (`nyc_taxis`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	3%	23%	7%	13%
p90 JVM	23%	23%	21%	22%
Max `index_searcher` active threads	–	22	2	4

Similarly to the initial performance results shared in our first blog post, with the larger r5.8xlarge instance type we see strong performance improvements in long-running, CPU-intensive operations. As for system resource utilization, CPU usage increases as expected when the number of active concurrent search threads increases. However, the p90 JVM heap utilization appears to be mostly uncorrelated with increased concurrency.

Taking a step back, the theoretical maximum performance gained from using concurrent segment search is approximately a 50% improvement in shard-level search request latency for every twofold increase in concurrency. For example, when going from no concurrent search to concurrent search with 2 slices, the theoretical maximum performance improvement is 50%. Increasing to 4 slices, the maximum performance improvement is 75%, and to 8 slices, 87.5%. Additionally, for every doubling of concurrency, we expect the CPU utilization to roughly double, assuming the same work distribution across slices. This is because twice the number of CPU threads is used, although for a shorter duration.

With that in mind, even in Cluster setup 1, we begin to observe diminishing returns in performance improvement as we increase the slice-level concurrency. This change can largely be attributed to duplicate work and the additional effort required to reduce the number of slice-level search results as the number of slices increases.

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

The following table provides example performance improvement data for the range-auto-date-histo-with-metrics operation.

Comparison	% Performance improvement	% Additional CPU utilization (p90)
Concurrent search disabled to 2 slices	46%	3%
2 slices to 4 slices	22%	6%
4 slices to Lucene default slice count	13%	24%

When we compare concurrent search with 2 slices to not using concurrent search, we can see that we get a 46% performance improvement by utilizing just 3% more CPU. However, performance improvements diminish as we increase the slice count in order to utilize more CPU. We can see that going from 4 slices to the Lucene default slice count results in only a 13% performance improvement at the cost of a 24% higher CPU utilization.

Of course, in the real world, a cluster rarely serves only a single search request at a time. To understand how the performance of concurrent segment search changes as the load on a cluster increases, we ran performance tests on a few additional cluster setups. The results are presented in the following sections.

Cluster setup 2

Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
Node count: 1
Shard count: 1
Search client count: 2
Concurrent search thread pool size: 64

p90 query latency comparison

Operation	CS disabled (in ms)	CS enabled (Lucene default slices) (in ms)	% Improvement	CS enabled (fixed slice count=2) (in ms)	% Improvement	CS enabled (fixed slice count=4) (in ms)	% Improvement
`range-auto-date-histo-with-metrics` (`big5`)	21888	4544	79%	11930	45%	6920	68%
`range-auto-date-histo` (`big5`)	8235	1634	80%	4532	45%	2633	68%
`query-string-on-message` (`big5`)	142	49	65%	101	29%	63	55%
`keyword-in-range` (`big5`)	127	61	52%	105	18%	73	43%
`distance_amount_agg` (`nyc_taxis`)	12335	2941	76%	6969	44%	3689	70%

System metrics (`big5`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	6%	60%	12%	25%
p90 JVM	54%	53%	56%	54%
Max `index_searcher` active threads	–	34	4	8

System metrics (`nyc_taxis`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	6%	47%	13%	24%
p90 JVM	59%	59%	50%	55%
Max `index_searcher` active threads	–	44	4	8

In Cluster setup 2, we increase the search client count to 2, so there are now 2 search clients sending search requests to the cluster at the same time. Based on the system metrics, we can confirm that the max index_searcher active threads metric is showing twice as many active threads in the 2-slice, 4-slice, and Lucene default cases. Moreover, the OpenSearch Benchmark workloads run in benchmarking mode, which means that there is no delay between requests: As soon as search clients receive a response from the server, they send a subsequent request.

Based on the system utilization metrics, even with the additional search client, CPU utilization remains below 60%. In fact, even in the worst-case scenario, CPU is not being fully utilized. Therefore, we don’t observe any decline in performance improvement when comparing various slice count scenarios between Cluster setup 1 and Cluster setup 2. As we increase the number of slices, we continue to observe diminishing returns in performance improvement relative to the rise in CPU usage.

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

The following table provides example performance improvement data for the range-auto-date-histo-with-metrics operation.

Comparison	% Performance improvement	% Additional CPU utilization (p90)
Concurrent search disabled to 2 slices	45%	6%
2 slices to 4 slices	23%	13%
4 slices to Lucene default slice count	11%	35%

Cluster setup 3

Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
Node count: 1
Shard count: 1
Search client count: 4
Concurrent search thread pool size: 64

p90 query latency comparison

Operation	CS disabled (in ms)	CS enabled (Lucene default slices) (in ms)	% Improvement	CS enabled (fixed slice count=2) (in ms)	% Improvement	CS enabled (fixed slice count=4) (in ms)	% Improvement
`range-auto-date-histo-with-metrics` (`big5`)	21307	6398	70%	11692	45%	6921	68%
`range-auto-date-histo` (`big5`)	8088	2444	70%	4504	44%	2727	66%
`query-string-on-message` (`big5`)	142	51	64%	103	27%	69	52%
`keyword-in-range` (`big5`)	132	68	48%	110	17%	81	39%
`distance_amount_agg` (`nyc_taxis`)	12022	3512	71%	6362	47%	3649	70%

System metrics (`big5`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	13%	93%	25%	49%
p90 JVM	54%	54%	54%	54%
Max `index_searcher` active threads	–	64	8	16

System metrics (`nyc_taxis`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	12%	77%	25%	50%
p90 JVM	59%	59%	59%	59%
Max `index_searcher` active threads	–	64	8	16

The next setup serves search requests to 4 search clients concurrently. For the Lucene default slice count, this scenario creates enough segment slices to fill the index_searcher thread pool. The maximum number of concurrent search active threads is 64, which is equal to the thread pool size. Because of this, the majority of the available CPU resources are consumed in the Lucene default slice count case, leading to diminishing performance improvement.

Cluster setup 1 and Cluster setup 2 showed a roughly 80% performance improvement in the range-auto-date-histo-with-metrics operation for the Lucene default case. However, in Cluster setup 3, when we reach the CPU availability bottleneck, this same performance improvement decreases to 70%.

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

The following table provides example performance improvement data for the range-auto-date-histo-with-metrics operation.

Comparison	% Performance improvement	% Additional CPU utilization (p90)
Concurrent search disabled to 2 slices	45%	12%
2 slices to 4 slices	22%	24%
4 slices to Lucene default slice count	2%	44%

Reviewing the range-auto-date-histo-with-metrics operation again shows that as we approach 100% CPU utilization, the performance benefit provided by additional slices when using concurrent segment search mostly disappears. In Cluster setup 3, when comparing 4 slices to the Lucene default slice count, we see only a 2% performance improvement at the cost of a prohibitive 44% additional CPU usage.

Cluster setup 4

Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
Node count: 1
Shard count: 1
Search client count: 8
Concurrent search thread pool size: 64

p90 query latency comparison

Operation	CS disabled (in ms)	CS enabled (Lucene default slices) (in ms)	% Improvement	CS enabled (fixed slice count=2) (in ms)	% Improvement	CS enabled (fixed slice count=4) (in ms)	% Improvement
`range-auto-date-histo-with-metrics` (`big5`)	21641	11937	45%	11596	46%	11884	45%
`range-auto-date-histo` (`big5`)	8382	4457	47%	4536	46%	4437	47%
`query-string-on-message` (`big5`)	166	83	50%	118	29%	75	55%
`keyword-in-range` (`big5`)	162	99	39%	126	22%	90	44%
`distance_amount_agg` (`big5`)	11727	5420	54%	6586	44%	5326	55%

System metrics (`big5`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	25%	100%	49%	99%
p90 JVM	56%	55%	54%	54%
Max `index_searcher` active threads	–	64	16	32

System metrics (`nyc_taxis`)

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=2)	CS enabled (fixed slice count=4)
p90 CPU	26%	100%	50%	100%
p90 JVM	60%	58%	59%	58%
Max `index_searcher` active threads	–	64	16	32

For this setup, we again double the number of search clients, to 8. Based on the system resource utilization metrics, we see that in both the 4-slice and the Lucene default slice cases, we reach 100% CPU utilization. As expected, increasing the slice count in this scenario results in even more pronounced diminishing returns and, in some cases, even slight performance regressions.

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

The following table provides example performance improvement data for the range-auto-date-histo-with-metrics operation.

Comparison	% Performance improvement	% Additional CPU utilization (p90)
Concurrent search disabled to 2 slices	46%	24%
2 slices to 4 slices	-1%	50%
4 slices to Lucene default slice count	0%	1%

In Cluster setup 3, we saw little to no benefit in going from 4 slices to the Lucene default slice count when CPU utilization reached 100%. Similarly, moving from 2 slices to 4 slices in this scenario yields minimal benefit when CPU utilization reaches 100%. We can clearly see that in scenarios with a high number of search clients sending requests concurrently, performance gains are unlikely when we increase the slice-level concurrency on the cluster because CPU resource utilization starts to reach the maximum number of queries per second.

Comparing setups

The following sections present a comparison of the preceding setups.

Setup comparison for `range-auto-date-histo-with-metrics`

Clsuter configuration	% Performance improvement from CS disabled to 2 slices	% Additional CPU utilization	% Performance improvement from 2 slices to 4 slices	% Additional CPU utilization	% Performance improvement from 4 slices to Lucene default	% Additional CPU utilization
1 shard / 1 client	46%	3%	22%	6%	12%	24%
1 shard / 2 client	45%	6%	22%	13%	10%	35%
1 shard / 4 client	45%	12%	22%	24%	2%	44%
1 shard / 8 client	46%	24%	-1%	50%	0%	1%

Setup comparison for `distance_amount_agg`

Cluster configuration	% Performance improvement from CS disabled to 2 slices	% Additional CPU utilization	% Performance improvement from 2 slices to 4 slices	% Additional CPU utilization	% Performance improvement from 4 slices to Lucene default	% Additional CPU utilization
1 shard / 1 client	45%	4%	26%	6%	6%	10%
1 shard / 2 client	49%	7%	21%	11%	7%	23%
1 shard / 4 client	46%	13%	22%	25%	3%	27%
1 shard / 8 client	44%	24%	10%	50%	0%	0%

The main takeaway here is that whenever there are available CPU resources, you can improve performance by further increasing concurrency. However, once CPU resources are fully utilized, you will no longer see performance gains by increasing concurrency, and you may even see slight regressions. Moreover, there are diminishing returns on increasing the concurrency of a single request even when there are CPU resources available. This is because there is additional overhead introduced by the combination of duplicated work across slices in the concurrent portion and sequential work during the reduce phase. As the availability of these CPU resources decreases, the effect of the diminishing returns on concurrency is further amplified.

`noaa` workload

In addition to the nyc_taxis and big5 workloads, we also benchmarked performance for the noaa OpenSearch Benchmark workload, which is focused on aggregations. Because we saw the greatest improvements for the aggregation-related queries in nyc_taxis and big5, we wanted to see how these performance gains held up across datasets and queries.

Cluster setup 5

Instance type: r5.2xlarge (8 vCPUs, 64 GB RAM)
Node count: 1
Shard count: 1
Search client count: 1
Concurrent search thread pool size: 16

p90 query latency comparison

Workload operation	CS disabled (in ms)	CS enabled (Lucene default slices) (in ms)	% Improvement	CS enabled (fixed slice count=4) (in ms)	% Improvement
`date-histo-entire-range`	3	4	-31%	3	2%
`date-histo-string-significant-terms-via-map`	13700	7650	44%	7538	45%
`keyword-terms`	147	83	44%	78	47%
`range-auto-date-histo-with-metrics`	3957	2182	45%	2226	44%
`range-date-histo`	1803	948	47%	1029	43%
`range-numeric-significant-terms`	2597	2951	-14%	1627	37%

System metrics

Metric	CS disabled	CS enabled (Lucene default slices)	CS enabled (fixed slice count=4)
p90 CPU	25%	99%	50%
p90 JVM	60%	60%	60%
Max `index_searcher` active threads	–	16	4

The performance results for these aggregation types confirm the results we saw with previous cluster configurations. We see strong performance improvements in most aggregation types and, again, these performance improvements diminish, and sometimes even regress, as we increase concurrency on the CPU load.

Observations

Increases or decreases in performance related to concurrent segment search can usually be attributed to one of the following four reasons:

First, whenever the number of segment slices is large, the index_searcher thread pool is filled. Whenever there are no threads available to execute the shard search task for a slice, the slice waits in the queue until other slices are finished processing. For example, in Cluster setup 3 there are 4 * 17 = 68 total segment slices when using the Lucene default slice count but only 64 threads available in the concurrent search thread pool. Thus, 4 segment slices will spend some time waiting in the thread pool queue.
Second, whenever the number of active threads is higher than the number of CPU cores, each individual thread may spend more time processing because the CPU cores are multiplexing tasks. By default, the r5.2xlarge instance with 8 CPU cores has 16 threads in the index_searcher thread pool and 13 threads in the search thread pool. If all 29 threads are concurrently processing search tasks, then each individual thread will encounter a longer processing time because there are only 8 CPU cores to serve these 29 threads.
Third, the specific query implementation can greatly impact performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.
Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains realized from searching documents concurrently. For example, for aggregations, a new Aggregator instance is created for each segment slice. Each Aggregator creates an InternalAggregation object, which represents the buckets created during document collection. These InternalAggregation object instances are then processed sequentially during the reduce phase. As a result, a simple term aggregation can create up to slice_count * shard_size buckets per shard, which are then processed sequentially during the reduce phase.

Wrapping up

In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine whether the additional parallelization produced by adding more segment slices outweighs the additional processing overhead. Concurrent segment search is ready for use in production environments, and you can continue to track its ongoing improvements on this project board.

Additionally, to provide visibility into performance over time, we will publish nightly performance runs for concurrent segment search in OpenSearch Performance Benchmarks, covering all the test workloads mentioned in this post.

For guidelines on getting started with concurrent segment search, see General guidelines.

« Introducing OpenSearch 2.16 Optimize storage and performance with the MatchOnlyText field in OpenSearch »

Blog

Exploring concurrent segment search performance

Performance results

Cluster setup 1

p90 query latency comparison

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

Cluster setup 2

p90 query latency comparison

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

Cluster setup 3

p90 query latency comparison

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

Cluster setup 4

p90 query latency comparison

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

Comparing setups

Setup comparison for `range-auto-date-histo-with-metrics`

Setup comparison for `distance_amount_agg`

`noaa` workload

Cluster setup 5

p90 query latency comparison

System metrics

Observations

Wrapping up

Participate

Providers

Resources

Platform

Capabilities

Community

Documentation

Blog

Exploring concurrent segment search performance

Performance results

Cluster setup 1

p90 query latency comparison

System metrics (big5)

System metrics (nyc_taxis)

Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics

Cluster setup 2

p90 query latency comparison

System metrics (big5)

System metrics (nyc_taxis)

Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics

Cluster setup 3

p90 query latency comparison

System metrics (big5)

System metrics (nyc_taxis)

Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics

Cluster setup 4

p90 query latency comparison

System metrics (big5)

System metrics (nyc_taxis)

Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics

Comparing setups

Setup comparison for range-auto-date-histo-with-metrics

Setup comparison for distance_amount_agg

noaa workload

Cluster setup 5

p90 query latency comparison

System metrics

Observations

Wrapping up

Jay Deng

Sorabh Hamirwasia

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

System metrics (`big5`)

System metrics (`nyc_taxis`)

Performance improvements vs. CPU utilization of `range-auto-date-histo-with-metrics`

Setup comparison for `range-auto-date-histo-with-metrics`

Setup comparison for `distance_amount_agg`

`noaa` workload