Aggregate by multiple terms in OpenSearch

Thu, Jul 21, 2022 · Peng Huo, David Tippett

Multi-terms aggregation allows you to group and sort results from a query. While terms aggregation has existed in OpenSearch for some time, multi-terms aggregation allows for sorting by deeper levels. This is particularly useful in the observability space. Say, for example, you need to identify servers that have the most contention for CPU so you can either redistribute the load or scale up the server.

You can now do this with multi-terms aggregation, as shown by the demo below. This can be run from the Dev Tools section in OpenSearch Dashboards. If you need a test environment, you can spin up a single-node development environment using Docker Compose (make sure you have at least 4 GB of memory available in PreferencesResources).

Environment setup

First, we will set up a test index for storing our sample data. We can do this by issuing a PUT request with the index name we want to create—in this case, test_0001. The body in the example below contains the settings and mappings that we wish to use for each of the fields.

PUT test_0001
{
  "settings": {
    "index": {
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties": {
      "region": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      }, 
      "container": {
        "type": "keyword"
      },       
      "cpu": {
        "type": "integer"
      }       
    }
  }
}

Next, we can use the below GET request on the _mapping endpoint to validate the mappings have been created properly for our test index.

GET /test_0001/_mapping

Finally, we will add some test data to practice with. Our data here has four fields: region, host, container, and CPU.

PUT /test_0001/_bulk
{ "index" : { "_index" : "test_0001", "_id" : "1" } }
{ "region" : "iad", "host" : "h1", "container": "c1", "cpu": 10}
{ "index" : { "_index" : "test_0001", "_id" : "2" } }
{ "region" : "iad", "host" : "h1", "container": "c2", "cpu": 15}
{ "index" : { "_index" : "test_0001", "_id" : "3" } }
{ "region" : "iad", "host" : "h2", "container": "c1", "cpu": 20}
{ "index" : { "_index" : "test_0001", "_id" : "4" } }
{ "region" : "iad", "host" : "h2", "container": "c2", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "5" } }
{ "region" : "dub", "host" : "h1", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "6" } }
{ "region" : "dub", "host" : "h1", "container": "c2", "cpu": 90}
{ "index" : { "_index" : "test_0001", "_id" : "7" } }
{ "region" : "dub", "host" : "h2", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "8" } }
{ "region" : "dub", "host" : "h2", "container": "c2", "cpu": 70}

Using multi-terms aggregation

Now for the real fun! For those who may not have worked with aggregations, “hot” is an arbitrary name for this aggregation. Additionally, "size": 0 specifies that we do not want to return the documents that contributed to the query. The multi_terms field accepts keyword fields that are then used as buckets for aggregation. This is similar to how the SQL GROUP BY statement works, where the order in which the terms are specified determines the order in which they are grouped in the results.

The order section specifies an aggregation that should be used to order the buckets. The max-cpu aggregation finds the maximum value of the CPU metric for each bucket. This layering of sorting is what makes multi-terms aggregation so powerful.

GET /test_0001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi_terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}

Below we can see the results of our query. As a reminder, “hits” is empty here because we specified “size”: 0 earlier, meaning we don’t want to return the documents that contributed to the aggregations. Now we can see clearly that the region dub has the servers with the highest CPU contention—in particular, host h1.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "hot" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : [
            "dub",
            "h1"
          ],
          "key_as_string" : "dub|h1",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 90.0
          }
        },
        {
          "key" : [
            "dub",
            "h2"
          ],
          "key_as_string" : "dub|h2",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 70.0
          }
        },
        {
          "key" : [
            "iad",
            "h2"
          ],
          "key_as_string" : "iad|h2",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 50.0
          }
        },
        {
          "key" : [
            "iad",
            "h1"
          ],
          "key_as_string" : "iad|h1",
          "doc_count" : 2,
          "max-cpu" : {
            "value" : 15.0
          }
        }
      ]
    }
  }
}

Wrapping it up

With this information we can now see which regions and hosts have the most contention for CPU resources. This is just one example of how multi-terms aggregation can be useful. To learn more be sure to check out the documentation page or read the original pull request. Additionally, if you find any issues or have questions, feel free to either create an issue on GitHub or ask a question on the forum!