Exploring concurrent segment search performance
In October 2023, we introduced concurrent segment search in OpenSearch as an experimental feature. Searching segments concurrently improves search latency across a large variety of workloads. This feature was made generally available in OpenSearch 2.12; we highly recommend that you try it! Here, we’ll share performance results of simulations of different real-world scenarios. In particular, we’ll look at performance trends as available system resources decrease and concurrency increases.
Concurrent segment search divides each shard-level search request on a node into multiple execution tasks called slices. Slices can be executed concurrently on separate threads in the index_searcher
thread pool, separately from the search
thread pool. Each slice searches within its associated segments. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index_searcher
thread pool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index_searcher
thread pool has twice as many threads as the number of available processors.
Performance results
For our performance testing, we used a standard r5.8xlarge instance type, which has 32 vCPUs and an index_searcher
thread pool size of 64. This allowed us to explore various concurrency and cluster load scenarios on a realistic instance as well as capture a more accurate picture of performance as we increased the number of slices and search clients.
In our previous blog post, we showed that the strongest performance improvements were demonstrated in long-running and CPU-intensive operations, while fast queries like match_all
saw little improvement or even a slight performance regression from the overhead of concurrent segment search. Because in this post we want to focus on how performance changes under various cluster load conditions, we will look at a smaller subset of longer-running operations for the nyc_taxis
and big5
workloads using OpenSearch Benchmark. Additionally, we’ll report the p90 system metrics for the entire workload provided by OpenSearch Benchmark, which is most relevant to the longer-running operations.
First, we established a performance baseline using a single shard and a single search client. This configuration is the theoretical best-case scenario when using concurrent segment search because in this configuration, there is a maximum of 1 shard-level search request being processed at a time; thus, each request uses the maximum number of resources.
The following sections present the performance results. We abbreviate concurrent segment search as “CS”.
Cluster setup 1
- Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
- Node count: 1
- Shard count: 1
- Search client count: 1
- Concurrent search thread pool size: 64
p90 query latency comparison
Operation | CS disabled (in ms) | CS enabled (Lucene default slices) (in ms) | % Improvement | CS enabled (fixed slice count=2) (in ms) | % Improvement | CS enabled (fixed slice count=4) (in ms) | % Improvement |
---|---|---|---|---|---|---|---|
range-auto-date-histo-with-metrics (big5 ) |
21757 | 4178 | 81% | 11710 | 46% | 6915 | 68% |
range-auto-date-histo (big5 ) |
7950 | 1486 | 81% | 4341 | 45% | 2555 | 67% |
query-string-on-message (big5 ) |
143 | 49 | 66% | 81 | 43% | 59 | 58% |
keyword-in-range (big5 ) |
127 | 58 | 54% | 87 | 31% | 70 | 44% |
distance_amount_agg (nyc_taxis ) |
12403 | 2921 | 76% | 6642 | 46% | 3633 | 71% |
System metrics (big5
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 3% | 36% | 6% | 12% |
p90 JVM | 54% | 56% | 54% | 54% |
Max index_searcher active threads |
– | 17 | 2 | 4 |
System metrics (nyc_taxis
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 3% | 23% | 7% | 13% |
p90 JVM | 23% | 23% | 21% | 22% |
Max index_searcher active threads |
– | 22 | 2 | 4 |
Similarly to the initial performance results shared in our first blog post, with the larger r5.8xlarge
instance type we see strong performance improvements in long-running, CPU-intensive operations. As for system resource utilization, CPU usage increases as expected when the number of active concurrent search threads increases. However, the p90 JVM heap utilization appears to be mostly uncorrelated with increased concurrency.
Taking a step back, the theoretical maximum performance gained from using concurrent segment search is approximately a 50% improvement in shard-level search request latency for every twofold increase in concurrency. For example, when going from no concurrent search to concurrent search with 2 slices, the theoretical maximum performance improvement is 50%. Increasing to 4 slices, the maximum performance improvement is 75%, and to 8 slices, 87.5%. Additionally, for every doubling of concurrency, we expect the CPU utilization to roughly double, assuming the same work distribution across slices. This is because twice the number of CPU threads is used, although for a shorter duration.
With that in mind, even in Cluster setup 1, we begin to observe diminishing returns in performance improvement as we increase the slice-level concurrency. This change can largely be attributed to duplicate work and the additional effort required to reduce the number of slice-level search results as the number of slices increases.
Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics
The following table provides example performance improvement data for the range-auto-date-histo-with-metrics
operation.
Comparison | % Performance improvement | % Additional CPU utilization (p90) |
---|---|---|
Concurrent search disabled to 2 slices | 46% | 3% |
2 slices to 4 slices | 22% | 6% |
4 slices to Lucene default slice count | 13% | 24% |
When we compare concurrent search with 2 slices to not using concurrent search, we can see that we get a 46% performance improvement by utilizing just 3% more CPU. However, performance improvements diminish as we increase the slice count in order to utilize more CPU. We can see that going from 4 slices to the Lucene default slice count results in only a 13% performance improvement at the cost of a 24% higher CPU utilization.
Of course, in the real world, a cluster rarely serves only a single search request at a time. To understand how the performance of concurrent segment search changes as the load on a cluster increases, we ran performance tests on a few additional cluster setups. The results are presented in the following sections.
Cluster setup 2
- Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
- Node count: 1
- Shard count: 1
- Search client count: 2
- Concurrent search thread pool size: 64
p90 query latency comparison
Operation | CS disabled (in ms) | CS enabled (Lucene default slices) (in ms) | % Improvement | CS enabled (fixed slice count=2) (in ms) | % Improvement | CS enabled (fixed slice count=4) (in ms) | % Improvement |
---|---|---|---|---|---|---|---|
range-auto-date-histo-with-metrics (big5 ) |
21888 | 4544 | 79% | 11930 | 45% | 6920 | 68% |
range-auto-date-histo (big5 ) |
8235 | 1634 | 80% | 4532 | 45% | 2633 | 68% |
query-string-on-message (big5 ) |
142 | 49 | 65% | 101 | 29% | 63 | 55% |
keyword-in-range (big5 ) |
127 | 61 | 52% | 105 | 18% | 73 | 43% |
distance_amount_agg (nyc_taxis ) |
12335 | 2941 | 76% | 6969 | 44% | 3689 | 70% |
System metrics (big5
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 6% | 60% | 12% | 25% |
p90 JVM | 54% | 53% | 56% | 54% |
Max index_searcher active threads |
– | 34 | 4 | 8 |
System metrics (nyc_taxis
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 6% | 47% | 13% | 24% |
p90 JVM | 59% | 59% | 50% | 55% |
Max index_searcher active threads |
– | 44 | 4 | 8 |
In Cluster setup 2, we increase the search client count to 2, so there are now 2 search clients sending search requests to the cluster at the same time. Based on the system metrics, we can confirm that the max index_searcher
active threads metric is showing twice as many active threads in the 2-slice, 4-slice, and Lucene default cases. Moreover, the OpenSearch Benchmark workloads run in benchmarking mode, which means that there is no delay between requests: As soon as search clients receive a response from the server, they send a subsequent request.
Based on the system utilization metrics, even with the additional search client, CPU utilization remains below 60%. In fact, even in the worst-case scenario, CPU is not being fully utilized. Therefore, we don’t observe any decline in performance improvement when comparing various slice count scenarios between Cluster setup 1 and Cluster setup 2. As we increase the number of slices, we continue to observe diminishing returns in performance improvement relative to the rise in CPU usage.
Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics
The following table provides example performance improvement data for the range-auto-date-histo-with-metrics
operation.
Comparison | % Performance improvement | % Additional CPU utilization (p90) |
---|---|---|
Concurrent search disabled to 2 slices | 45% | 6% |
2 slices to 4 slices | 23% | 13% |
4 slices to Lucene default slice count | 11% | 35% |
Cluster setup 3
- Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
- Node count: 1
- Shard count: 1
- Search client count: 4
- Concurrent search thread pool size: 64
p90 query latency comparison
Operation | CS disabled (in ms) | CS enabled (Lucene default slices) (in ms) | % Improvement | CS enabled (fixed slice count=2) (in ms) | % Improvement | CS enabled (fixed slice count=4) (in ms) | % Improvement |
---|---|---|---|---|---|---|---|
range-auto-date-histo-with-metrics (big5 ) |
21307 | 6398 | 70% | 11692 | 45% | 6921 | 68% |
range-auto-date-histo (big5 ) |
8088 | 2444 | 70% | 4504 | 44% | 2727 | 66% |
query-string-on-message (big5 ) |
142 | 51 | 64% | 103 | 27% | 69 | 52% |
keyword-in-range (big5 ) |
132 | 68 | 48% | 110 | 17% | 81 | 39% |
distance_amount_agg (nyc_taxis ) |
12022 | 3512 | 71% | 6362 | 47% | 3649 | 70% |
System metrics (big5
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 13% | 93% | 25% | 49% |
p90 JVM | 54% | 54% | 54% | 54% |
Max index_searcher active threads |
– | 64 | 8 | 16 |
System metrics (nyc_taxis
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 12% | 77% | 25% | 50% |
p90 JVM | 59% | 59% | 59% | 59% |
Max index_searcher active threads |
– | 64 | 8 | 16 |
The next setup serves search requests to 4 search clients concurrently. For the Lucene default slice count, this scenario creates enough segment slices to fill the index_searcher
thread pool. The maximum number of concurrent search active threads is 64, which is equal to the thread pool size. Because of this, the majority of the available CPU resources are consumed in the Lucene default slice count case, leading to diminishing performance improvement.
Cluster setup 1 and Cluster setup 2 showed a roughly 80% performance improvement in the range-auto-date-histo-with-metrics
operation for the Lucene default case. However, in Cluster setup 3, when we reach the CPU availability bottleneck, this same performance improvement decreases to 70%.
Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics
The following table provides example performance improvement data for the range-auto-date-histo-with-metrics
operation.
Comparison | % Performance improvement | % Additional CPU utilization (p90) |
---|---|---|
Concurrent search disabled to 2 slices | 45% | 12% |
2 slices to 4 slices | 22% | 24% |
4 slices to Lucene default slice count | 2% | 44% |
Reviewing the range-auto-date-histo-with-metrics
operation again shows that as we approach 100% CPU utilization, the performance benefit provided by additional slices when using concurrent segment search mostly disappears. In Cluster setup 3, when comparing 4 slices to the Lucene default slice count, we see only a 2% performance improvement at the cost of a prohibitive 44% additional CPU usage.
Cluster setup 4
- Instance type: r5.8xlarge (32 vCPUs, 256 GB RAM)
- Node count: 1
- Shard count: 1
- Search client count: 8
- Concurrent search thread pool size: 64
p90 query latency comparison
Operation | CS disabled (in ms) | CS enabled (Lucene default slices) (in ms) | % Improvement | CS enabled (fixed slice count=2) (in ms) | % Improvement | CS enabled (fixed slice count=4) (in ms) | % Improvement |
---|---|---|---|---|---|---|---|
range-auto-date-histo-with-metrics (big5 ) |
21641 | 11937 | 45% | 11596 | 46% | 11884 | 45% |
range-auto-date-histo (big5 ) |
8382 | 4457 | 47% | 4536 | 46% | 4437 | 47% |
query-string-on-message (big5 ) |
166 | 83 | 50% | 118 | 29% | 75 | 55% |
keyword-in-range (big5 ) |
162 | 99 | 39% | 126 | 22% | 90 | 44% |
distance_amount_agg (big5 ) |
11727 | 5420 | 54% | 6586 | 44% | 5326 | 55% |
System metrics (big5
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 25% | 100% | 49% | 99% |
p90 JVM | 56% | 55% | 54% | 54% |
Max index_searcher active threads |
– | 64 | 16 | 32 |
System metrics (nyc_taxis
)
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=2) | CS enabled (fixed slice count=4) |
---|---|---|---|---|
p90 CPU | 26% | 100% | 50% | 100% |
p90 JVM | 60% | 58% | 59% | 58% |
Max index_searcher active threads |
– | 64 | 16 | 32 |
For this setup, we again double the number of search clients, to 8. Based on the system resource utilization metrics, we see that in both the 4-slice and the Lucene default slice cases, we reach 100% CPU utilization. As expected, increasing the slice count in this scenario results in even more pronounced diminishing returns and, in some cases, even slight performance regressions.
Performance improvements vs. CPU utilization of range-auto-date-histo-with-metrics
The following table provides example performance improvement data for the range-auto-date-histo-with-metrics
operation.
Comparison | % Performance improvement | % Additional CPU utilization (p90) |
---|---|---|
Concurrent search disabled to 2 slices | 46% | 24% |
2 slices to 4 slices | -1% | 50% |
4 slices to Lucene default slice count | 0% | 1% |
In Cluster setup 3, we saw little to no benefit in going from 4 slices to the Lucene default slice count when CPU utilization reached 100%. Similarly, moving from 2 slices to 4 slices in this scenario yields minimal benefit when CPU utilization reaches 100%. We can clearly see that in scenarios with a high number of search clients sending requests concurrently, performance gains are unlikely when we increase the slice-level concurrency on the cluster because CPU resource utilization starts to reach the maximum number of queries per second.
Comparing setups
The following sections present a comparison of the preceding setups.
Setup comparison for range-auto-date-histo-with-metrics
Clsuter configuration | % Performance improvement from CS disabled to 2 slices | % Additional CPU utilization | % Performance improvement from 2 slices to 4 slices | % Additional CPU utilization | % Performance improvement from 4 slices to Lucene default | % Additional CPU utilization |
---|---|---|---|---|---|---|
1 shard / 1 client | 46% | 3% | 22% | 6% | 12% | 24% |
1 shard / 2 client | 45% | 6% | 22% | 13% | 10% | 35% |
1 shard / 4 client | 45% | 12% | 22% | 24% | 2% | 44% |
1 shard / 8 client | 46% | 24% | -1% | 50% | 0% | 1% |
Setup comparison for distance_amount_agg
Cluster configuration | % Performance improvement from CS disabled to 2 slices | % Additional CPU utilization | % Performance improvement from 2 slices to 4 slices | % Additional CPU utilization | % Performance improvement from 4 slices to Lucene default | % Additional CPU utilization |
---|---|---|---|---|---|---|
1 shard / 1 client | 45% | 4% | 26% | 6% | 6% | 10% |
1 shard / 2 client | 49% | 7% | 21% | 11% | 7% | 23% |
1 shard / 4 client | 46% | 13% | 22% | 25% | 3% | 27% |
1 shard / 8 client | 44% | 24% | 10% | 50% | 0% | 0% |
The main takeaway here is that whenever there are available CPU resources, you can improve performance by further increasing concurrency. However, once CPU resources are fully utilized, you will no longer see performance gains by increasing concurrency, and you may even see slight regressions. Moreover, there are diminishing returns on increasing the concurrency of a single request even when there are CPU resources available. This is because there is additional overhead introduced by the combination of duplicated work across slices in the concurrent portion and sequential work during the reduce phase. As the availability of these CPU resources decreases, the effect of the diminishing returns on concurrency is further amplified.
noaa
workload
In addition to the nyc_taxis
and big5
workloads, we also benchmarked performance for the noaa
OpenSearch Benchmark workload, which is focused on aggregations. Because we saw the greatest improvements for the aggregation-related queries in nyc_taxis
and big5
, we wanted to see how these performance gains held up across datasets and queries.
Cluster setup 5
- Instance type: r5.2xlarge (8 vCPUs, 64 GB RAM)
- Node count: 1
- Shard count: 1
- Search client count: 1
- Concurrent search thread pool size: 16
p90 query latency comparison
Workload operation | CS disabled (in ms) | CS enabled (Lucene default slices) (in ms) | % Improvement | CS enabled (fixed slice count=4) (in ms) | % Improvement |
---|---|---|---|---|---|
date-histo-entire-range |
3 | 4 | -31% | 3 | 2% |
date-histo-string-significant-terms-via-map |
13700 | 7650 | 44% | 7538 | 45% |
keyword-terms |
147 | 83 | 44% | 78 | 47% |
range-auto-date-histo-with-metrics |
3957 | 2182 | 45% | 2226 | 44% |
range-date-histo |
1803 | 948 | 47% | 1029 | 43% |
range-numeric-significant-terms |
2597 | 2951 | -14% | 1627 | 37% |
System metrics
Metric | CS disabled | CS enabled (Lucene default slices) | CS enabled (fixed slice count=4) |
---|---|---|---|
p90 CPU | 25% | 99% | 50% |
p90 JVM | 60% | 60% | 60% |
Max index_searcher active threads |
– | 16 | 4 |
The performance results for these aggregation types confirm the results we saw with previous cluster configurations. We see strong performance improvements in most aggregation types and, again, these performance improvements diminish, and sometimes even regress, as we increase concurrency on the CPU load.
Observations
Increases or decreases in performance related to concurrent segment search can usually be attributed to one of the following four reasons:
-
First, whenever the number of segment slices is large, the
index_searcher
thread pool is filled. Whenever there are no threads available to execute the shard search task for a slice, the slice waits in the queue until other slices are finished processing. For example, in Cluster setup 3 there are4 * 17 = 68
total segment slices when using the Lucene default slice count but only 64 threads available in the concurrent search thread pool. Thus, 4 segment slices will spend some time waiting in the thread pool queue. -
Second, whenever the number of active threads is higher than the number of CPU cores, each individual thread may spend more time processing because the CPU cores are multiplexing tasks. By default, the
r5.2xlarge
instance with 8 CPU cores has 16 threads in theindex_searcher
thread pool and 13 threads in the search thread pool. If all 29 threads are concurrently processing search tasks, then each individual thread will encounter a longer processing time because there are only 8 CPU cores to serve these 29 threads. -
Third, the specific query implementation can greatly impact performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.
-
Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains realized from searching documents concurrently. For example, for aggregations, a new
Aggregator
instance is created for each segment slice. EachAggregator
creates anInternalAggregation
object, which represents the buckets created during document collection. TheseInternalAggregation
object instances are then processed sequentially during the reduce phase. As a result, a simpleterm
aggregation can create up toslice_count * shard_size
buckets per shard, which are then processed sequentially during the reduce phase.
Wrapping up
In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine whether the additional parallelization produced by adding more segment slices outweighs the additional processing overhead. Concurrent segment search is ready for use in production environments, and you can continue to track its ongoing improvements on this project board.
Additionally, to provide visibility into performance over time, we will publish nightly performance runs for concurrent segment search in OpenSearch Performance Benchmarks, covering all the test workloads mentioned in this post.
For guidelines on getting started with concurrent segment search, see General guidelines.