Most Recent Articles
Reducing hybrid query latency in OpenSearch 3.1 with efficient score collection	Jun 27
Announcing OpenSearch Data Prepper 2.12: Additional source and sinks for your...	Jun 26
Redline testing now available in OpenSearch Benchmark	Jun 23
Neural sparse models are now available in Hugging Face Sentence Transformers	Jun 11
Unlocking agentic AI experiences with OpenSearch	Jun 09
Enhanced log analysis with OpenSearch PPL: Introducing lookup, join, and subs...	May 30
Optimizing inference processors for cost efficiency and performance	May 29
Improve OpenSearch cluster performance by separating search and indexing work...	May 27
Introducing common filter support for hybrid search queries	May 23
Do more with less: Save up to 3x on storage with derived vector source	May 20

Aggregate by multiple terms in OpenSearch

Thu, Jul 21, 2022 · Peng Huo, David Tippett

Multi-terms aggregation allows you to group and sort results from a query. While terms aggregation has existed in OpenSearch for some time, multi-terms aggregation allows for sorting by deeper levels. This is particularly useful in the observability space. Say, for example, you need to identify servers that have the most contention for CPU so you can either redistribute the load or scale up the server.

You can now do this with multi-terms aggregation, as shown by the demo below. This can be run from the Dev Tools section in OpenSearch Dashboards. If you need a test environment, you can spin up a single-node development environment using Docker Compose (make sure you have at least 4 GB of memory available in Preferences → Resources).

Environment setup

First, we will set up a test index for storing our sample data. We can do this by issuing a PUT request with the index name we want to create—in this case, test_0001. The body in the example below contains the settings and mappings that we wish to use for each of the fields.

PUT test_0001
{
  "settings": {
    "index": {
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties": {
      "region": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      }, 
      "container": {
        "type": "keyword"
      },       
      "cpu": {
        "type": "integer"
      }       
    }
  }
}

Next, we can use the below GET request on the _mapping endpoint to validate the mappings have been created properly for our test index.

GET /test_0001/_mapping

Finally, we will add some test data to practice with. Our data here has four fields: region, host, container, and CPU.

PUT /test_0001/_bulk
{ "index" : { "_index" : "test_0001", "_id" : "1" } }
{ "region" : "iad", "host" : "h1", "container": "c1", "cpu": 10}
{ "index" : { "_index" : "test_0001", "_id" : "2" } }
{ "region" : "iad", "host" : "h1", "container": "c2", "cpu": 15}
{ "index" : { "_index" : "test_0001", "_id" : "3" } }
{ "region" : "iad", "host" : "h2", "container": "c1", "cpu": 20}
{ "index" : { "_index" : "test_0001", "_id" : "4" } }
{ "region" : "iad", "host" : "h2", "container": "c2", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "5" } }
{ "region" : "dub", "host" : "h1", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "6" } }
{ "region" : "dub", "host" : "h1", "container": "c2", "cpu": 90}
{ "index" : { "_index" : "test_0001", "_id" : "7" } }
{ "region" : "dub", "host" : "h2", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "8" } }
{ "region" : "dub", "host" : "h2", "container": "c2", "cpu": 70}

Using multi-terms aggregation

Now for the real fun! For those who may not have worked with aggregations, “hot” is an arbitrary name for this aggregation. Additionally, "size": 0 specifies that we do not want to return the documents that contributed to the query. The multi_terms field accepts keyword fields that are then used as buckets for aggregation. This is similar to how the SQL GROUP BY statement works, where the order in which the terms are specified determines the order in which they are grouped in the results.

The order section specifies an aggregation that should be used to order the buckets. The max-cpu aggregation finds the maximum value of the CPU metric for each bucket. This layering of sorting is what makes multi-terms aggregation so powerful.

GET /test_0001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi_terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}

Below we can see the results of our query. As a reminder, “hits” is empty here because we specified “size”: 0 earlier, meaning we don’t want to return the documents that contributed to the aggregations. Now we can see clearly that the region dub has the servers with the highest CPU contention—in particular, host h1.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "hot" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : [
            "dub",
            "h1"
          ],
          "key_as_string" : "dub|h1",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 90.0
          }
        },
        {
          "key" : [
            "dub",
            "h2"
          ],
          "key_as_string" : "dub|h2",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 70.0
          }
        },
        {
          "key" : [
            "iad",
            "h2"
          ],
          "key_as_string" : "iad|h2",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 50.0
          }
        },
        {
          "key" : [
            "iad",
            "h1"
          ],
          "key_as_string" : "iad|h1",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 15.0
          }
        }
      ]
    }
  }
}

Wrapping it up

With this information we can now see which regions and hosts have the most contention for CPU resources. This is just one example of how multi-terms aggregation can be useful. To learn more be sure to check out the documentation page or read the original pull request. Additionally, if you find any issues or have questions, feel free to either create an issue on GitHub or ask a question on the forum!

« Which Way is the Compass Pointing? Using Fluent Bit and OpenSearch with Bottlerocket and Kubelet logs »

Blog

Aggregate by multiple terms in OpenSearch

Environment setup

Using multi-terms aggregation

Wrapping it up

Participate

Providers

Resources

Platform

Capabilities

Community

Documentation

Blog

Aggregate by multiple terms in OpenSearch

Environment setup

Using multi-terms aggregation

Wrapping it up

Peng Huo

David Tippett