Most Recent Articles
Diving deep into distributed microservices with OpenSearch and OpenTelemetry	Aug 12
Transforming bucket aggregations: Our journey to 100x performance improvement	Aug 05
Performance optimizations for the OpenSearch security layer	Jul 30
OpenSearch User Groups: Building a Global Community	Jul 18
OpenSearch Approximation Framework	Jul 14
The new semantic field: Simplifying semantic search in OpenSearch	Jul 11
Advanced usage of the semantic field in OpenSearch	Jul 10
Making ingestion smarter: System ingest pipelines in OpenSearch	Jul 08
Introducing semantic highlighting in OpenSearch	Jul 07
Building a multimodal search engine in OpenSearch	Jul 07

Improving Anomaly Detection: One million entities in one minute

Wed, Oct 05, 2022 · Sudipto Guha, Nate Archer, Pavani Baddepudi, Joshua Tokle, Amit Galitzky, Kaituo Li

“When you can measure what you are speaking about, and express it in numbers, you know something about it.” William Thomson, co-formulator of Thermodynamics

We continually strive to improve the existing OpenSearch features through harnessing the capabilities of OpenSearch itself. One such feature is the Anomaly Detection (AD) plugin, which automatically detects anomalies in your OpenSearch data.

Because OpenSearch is used to index high volumes of data in a distributed fashion, we knew it was essential to design the AD feature to have minimal impact on application workloads. OpenSearch 1.0.1 did not scale beyond 360K entities. Since OpenSearch 1.2.4, it has been possible to track one million entities with a data arrival rate of 10 minutes using 36 data nodes.

While the increase to one million entities was great, most monitoring solutions generate data at a far higher rate. If you want to react quickly to emergent scenarios within your cluster, that 10-minute interval is insufficient. In order for AD to be truly useful, our goal was simple: Shorten the interval to one minute for one million entities, without changing the model output or increasing the number of nodes.

The task of improving AD since its release has mirrored OpenSearch itself—-distributed, dynamic, and not easy to categorize in a linear manner. This post describes the non-linear task of improving AD for all supported OpenSearch versions using the built-in features of OpenSearch. These features include the streamlined access of documents (asynchronous mode, index sorting), changes to system resource management (distributing CPU load, increasing memory), and innovations in the AD model (lower memory, faster throughput).

First steps

At first, it was unclear how to model data characteristics that would measure bottlenecks in a 39-node (36 data nodes, r5.2xlarge Graviton instances) system. Inspired by Thomson, we began testing synthetic periodic functions with injected noise and some anomalies. A rough estimate of model inference time (including time for multiple invocations scoring, supporting explainability features, and updates) was about two-minutes of compute in a 10-minute interval. Clearly, reducing the interval to one-minute would require newer ideas, and we would have to focus on the model.

RCF 2.0 compared with 3.0

One idea was to initiate a Random Cut Forest (RCF) 3.0 model that made scoring and conditional computing of all the explanatory information available through a single API call based on a predictor-corrector architecture reminiscent of a streaming Kalman filter.

RCF was originally designed to flag observations that were too difficult to explain given past data. With RCF 3.0, the algorithm now extends seamlessly to support forecasting and many other capabilities of streaming random forests.

Each model had a dynamic cache internal to the model. Memory consumption corresponded to storing an unused model, whereas the throughput corresponded to a model in use. That difference is germane when each node is processing 1,000,000/36≈27,750 models; the 3x overhead from a fully enlarged model in the running threads would only be dwarfed if we were running 27,750/3≈9,250 threads.

We did not have 9,250 threads. To reduce the number of running threads, we used a dynamic trade-off, switching efficiently between the dynamic and static memory, instead of the static memory-throughput trade-off used from RCF 2.0 used in OpenSearch 1.2.4. When we did, the cache was turned on at the invocation, and all the explanatory computation was performed and discarded afterward.

Since the memory-throughput trade-off was dynamic, we also explored alternate tree representations. In aggregate, we were handling 30 million decision trees. Therefore, any per-tree improvements we could make to size mattered. The newer RCF 3.0 model was at least 2x faster and 30 percent smaller when stored in memory compared to the corresponding usage of RCF 2.0 in OpenSearch 1.2.4.

But this still fell short of our one-minute goal. We would need to change more than the algorithm’s memory throughput and also make tweaks to our CPU usage, especially for OpenSearch 2.0 and above.

CPU usage

In OpenSearch 2.0.1, RCF 3.0 took 30 percent more time when compared with the 2-minute per-interval execution time of OpenSearch 1.2.4. In OpenSearch 1.2.4, the JVM usage was smaller by 30 percent, just as we had expected from smaller models. Also, CPU spikes decreased by 30 percent, even though total execution was larger. To leverage OpenSearch better, we investigated comparable Graviton (r6g.2xlarge) instances and found that CPU spikes were the same as in our baseline (1.2.4) and that the execution time was 20 percent faster. We also looked at using 9 nodes of r6g.8xlarge or r5.8xlarge (both of which have the same vCPUs and RAM in their initial configurations) and found that in both cases, the CPU spikes were 4x smaller.

Still, execution time in OpenSearch 2.0.1 was 1.5x slower than in our baseline. To dig deeper, we looked at the CPU spikes of the original c5.4xlarge nodes in OpenSearch 2.0.1 and found the underlying issue.

CPU spikes in OpenSearch 2.0.1

CPU usage stayed around 1 percent, with hourly spikes of up to 65 percent. The spikes were likely caused by the usual culprits: internal hourly maintenance jobs on the cluster, saving hundreds of thousands of model checkpoints and then clearing unused models, and performing bookkeeping for internal states.

For OpenSearch 2.2, we chose to even out the resource usage across a large maintenance window. We also changed the pagination of the coordinating nodes from sync to async mode so that the coordinating nodes did not need to wait for responses before fetching the next page.

Setting up the detector

To set up the detector itself, we used the same category order in our configuration as in our OpenSearch index fields. For example, if the category fields for which we wish to search for anomalies are host and process, and we know that we set indexes to look for the host field before process field, we reflect that order in the category_field:

{
    "name": "detect_gc_time",
    "description": "detect gc processing time anomaly",
    "time_field": "@timestamp",
    "indices": [
        "host-cloudwatch"
    ],
    "category_field": ["host", "process"],
    ...

The detector becomes more efficient when we set it up by using the same field order in our documents and in the sort.field. Therefore, the following example request reflects the same category order: host > process > @timestamp. The body also allocates two nodes for the detector in order to balance shard size.

request_body = {
            "settings":{
              
              "index": {
                "sort.field": [ "host", "process", "@timestamp"], 
                "sort.order": [ "asc", "asc", "desc"],
                "routing.allocation.total_shards_per_node": "2"
              }
           },
           ...
        } 

When set up with the same category and sort order, the CPU spikes when using OpenSearch 2.0 were below 25%.

CPU spikes after category adjustment

Finally, we achieved continuous anomaly detection of one million entities at a one-minute interval.

Testing for anomalies in the AD plugin

It was serendipitous that the measurements taken from improving the plugin helped us uncover more about the anomalies within the plugin itself. The reading shown in the following image was taken from the plugin’s JVM.

Anomalies in AD plugin JVM

Interestingly, our explorations of the 9 r6g.8xlarge Graviton instances, each with a 128 GB (50 percent) heap, resulted in the measurements seen in the following readings. Notice the lower CPU spikes when compared with our measurements of the same instances from OpenSearch 2.0.

Comparison of CPU spikes in Graviton nodes vs OpenSearch 2.0

Memory pressure in Graviton nodes

See it for yourself

The improvements outlined in this post are available in OpenSearch 2.2 or greater.

If you want to run this experiment for yourself, remember the following:

Use the sort.order of the index for which you are detecting anomalies.
Use the following cluster settings to ensure that 50 percent of dynamic memory is available to the AD plugin, page_size is 10K, and max_entities_per_query is 1 million.

PUT /_cluster/settings
{
    "persistent": {
        "plugins.anomaly_detection.model_max_size_percent": "0.5",
        "plugins.anomaly_detection.page_size": "10000",
        "plugins.anomaly_detection.max_entities_per_query" : "1000000"
    }
}

But what if I don’t use one million entities

From OpenSearch 1.2.4 to OpenSearch 2.2 or greater, many incremental improvements were made to AD, in particular to historical analysis and other downstream log analytics tasks. However, “cold start,” the gap between loading data and seeing results, has been a known challenge in AD since the beginning. Despite this challenge, the cold start gap has decreased from release to release as the AD model has improved.

Because of these improvements over time, the performance of AD is not hampered in more modest cluster settings. To prove this, we performed experiments comparing the recommended testing framework for index/query workloads on a 5-node (2 data node, all c5.4xlarge) cluster both without AD and with AD.

When adding AD, we set AD memory usage to the default 10 percent. We performed each experiment twice (for verification), with two warmup iterations and five test iterations. The final result is the average of these two experiments.

Before each experiment, we restarted the whole cluster, cleared the OpenSearch cache, and removed all existing AD detectors, when applicable.

The following table compares the index latency, CPU and memory usage, GC time and cold start average time with and without AD under the indexing workload.

Version	query latency P50 (ms)	query latency P90 (ms)	CPU usage P90 (%)	Memory usage P90 (%)	GC time young (ms)	cold start avg (ms)
No AD
2.3	60.958	67.33	73	52.2	56,992
2.2	62.649	69.636	68	52	60,831
2.1	61.693	67.782	63.333	51	59,736
2.0.1	67.682	71.785	69	48	59,151
1.3.3	84.618	90.82	52	44	65,240
1.2.4	77.54	85.376	61	45	40,103
1.1.0	87.33	93.147	56.625	45.5	36,206
1.0.1	77.224	82.556	62	47	37,899

With AD
2.3	97.319	106.2	68.367	53	63,722	7.93
2.2	106.1	114.4	69.5	52	63,293	7.85
2.1	108.6	117.7	66.5	47	64,242	8.4
2.0.1	111.8	120.2	67	53	67,764	107.81
1.3.3	92.933	98.349	72	42	104,510	78.87
1.2.4	92.954	98.853	65.6	56	67,202	132.11
1.1.0	83.545	90.319	67	55	61,147	985.08
1.0.1	103.6	112	67	48.2	39,707	600000

The following table compares the index latency, CPU and memory usage, GC time and cold start average time with and without AD under the indexing workload.

Version	index latency P50 (ms)	index latency P90 (ms)	CPU usage P90 (%)	Memory usage P90 (%)	GC time young (ms)	cold start avg (ms)
No AD
2.3	510.6	717.8	64	50	13,881
2.2	510.9	709	61	50	13,994
2.1	515.5	723.9	77	51.333	14,227
2.0.1	534.6	755	77	51.333	14,386
1.3.3	548.6	776.4	72	49	30,752
1.2.4	604.3	840	64	56	16,767
1.1.0	579.3	821.6	69	51	16,286
1.0.1	563.7	795.6	68.375	55	16,730

With AD
2.3	565.5	808.5	65	56	17,487	9.4361
2.2	530.4	731.4	63	52	16,126	59.67
2.1	487.2	679.7	71.5	47	15,341	6.9428
2.0.1	481.3	680.5	72	53	21,098	71.2371
1.3.3	636.9	894	85.333	45	44,675	48.7008
1.2.4	618.7	870	79	56	30,081	47.8312
1.1	622	860	66.5	54.5	40,208	577.8122
1.0.1	583.4	824.7	57	54.667	17,642	600000

Conclusion

While the benchmark data produced in our experiment could be viewed as less than perfect or perhaps more illustrative of a thought experiment in physics, the improvements were real.

We do expect that explainability, the ability to understand the impact of a model of your nodes, outstrips the demands of inference within analytics. After all, the number of invocations within a model often takes more computational power than accounting for the data running inside. It is simpler to provide explainability based directly on the algorithm being used than to retrofit interpretations. Could streaming algorithms provide a process for invoking multiple API invocations and therefore improve explainability inside the AD model?

If that question or other questions related to machine learning interest you, we would love to hear from you about your experience.

To discuss topics with other OpenSearch users, start a conversation at forum.opensearch.org.
To request improvements to the AD plugin, create an issue in the Anomaly Detection plugin repository.

« Snapshot Management (SM): A new way to automate snapshots Hacktoberfest 2022 »

Blog

Improving Anomaly Detection: One million entities in one minute

First steps

RCF 2.0 compared with 3.0

CPU usage

Setting up the detector

Testing for anomalies in the AD plugin

See it for yourself

But what if I don’t use one million entities

Conclusion

Participate

Providers

Resources

Platform

Capabilities

Community

Documentation

Blog

Improving Anomaly Detection: One million entities in one minute

First steps

RCF 2.0 compared with 3.0

CPU usage

Setting up the detector

Testing for anomalies in the AD plugin

See it for yourself

But what if I don’t use one million entities

Conclusion

Sudipto Guha

Nate Archer

Pavani Baddepudi

Joshua Tokle

Amit Galitzky

Kaituo Li