Aggregate by multiple terms in OpenSearch
Multi-terms aggregation allows you to group and sort results from a query. While terms aggregation has existed in OpenSearch for some time, multi-terms aggregation allows for sorting by deeper levels. This is particularly useful in the observability space. Say, for example, you need to identify servers that have the most contention for CPU so you can either redistribute the load or scale up the server.
You can now do this with multi-terms aggregation, as shown by the demo below. This can be run from the Dev Tools section in OpenSearch Dashboards. If you need a test environment, you can spin up a single-node development environment using Docker Compose (make sure you have at least 4 GB of memory available in Preferences → Resources).
Environment setup
First, we will set up a test index for storing our sample data. We can do this by issuing a PUT
request with the index name we want to create—in this case, test_0001
. The body in the example below contains the settings
and mappings
that we wish to use for each of the fields.
PUT test_0001
{
"settings": {
"index": {
"number_of_replicas": 0
}
},
"mappings": {
"properties": {
"region": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"container": {
"type": "keyword"
},
"cpu": {
"type": "integer"
}
}
}
}
Next, we can use the below GET
request on the _mapping
endpoint to validate the mappings have been created properly for our test index.
GET /test_0001/_mapping
Finally, we will add some test data to practice with. Our data here has four fields: region, host, container, and CPU.
PUT /test_0001/_bulk
{ "index" : { "_index" : "test_0001", "_id" : "1" } }
{ "region" : "iad", "host" : "h1", "container": "c1", "cpu": 10}
{ "index" : { "_index" : "test_0001", "_id" : "2" } }
{ "region" : "iad", "host" : "h1", "container": "c2", "cpu": 15}
{ "index" : { "_index" : "test_0001", "_id" : "3" } }
{ "region" : "iad", "host" : "h2", "container": "c1", "cpu": 20}
{ "index" : { "_index" : "test_0001", "_id" : "4" } }
{ "region" : "iad", "host" : "h2", "container": "c2", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "5" } }
{ "region" : "dub", "host" : "h1", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "6" } }
{ "region" : "dub", "host" : "h1", "container": "c2", "cpu": 90}
{ "index" : { "_index" : "test_0001", "_id" : "7" } }
{ "region" : "dub", "host" : "h2", "container": "c1", "cpu": 50}
{ "index" : { "_index" : "test_0001", "_id" : "8" } }
{ "region" : "dub", "host" : "h2", "container": "c2", "cpu": 70}
Using multi-terms aggregation
Now for the real fun! For those who may not have worked with aggregations, “hot” is an arbitrary name for this aggregation. Additionally, "size": 0
specifies that we do not want to return the documents that contributed to the query. The multi_terms
field accepts keyword fields that are then used as buckets for aggregation. This is similar to how the SQL GROUP BY
statement works, where the order in which the terms are specified determines the order in which they are grouped in the results.
The order
section specifies an aggregation that should be used to order the buckets. The max-cpu
aggregation finds the maximum value of the CPU metric for each bucket. This layering of sorting is what makes multi-terms aggregation so powerful.
GET /test_0001/_search
{
"size": 0,
"aggs": {
"hot": {
"multi_terms": {
"terms": [{
"field": "region"
},{
"field": "host"
}],
"order": {"max-cpu": "desc"}
},
"aggs": {
"max-cpu": { "max": { "field": "cpu" } }
}
}
}
}
Below we can see the results of our query. As a reminder, “hits” is empty here because we specified “size”: 0 earlier, meaning we don’t want to return the documents that contributed to the aggregations. Now we can see clearly that the region dub has the servers with the highest CPU contention—in particular, host h1.
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"hot" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
"dub",
"h1"
],
"key_as_string" : "dub|h1",
"doc_count" : 2,
"max-cpu" : {
"value" : 90.0
}
},
{
"key" : [
"dub",
"h2"
],
"key_as_string" : "dub|h2",
"doc_count" : 2,
"max-cpu" : {
"value" : 70.0
}
},
{
"key" : [
"iad",
"h2"
],
"key_as_string" : "iad|h2",
"doc_count" : 2,
"max-cpu" : {
"value" : 50.0
}
},
{
"key" : [
"iad",
"h1"
],
"key_as_string" : "iad|h1",
"doc_count" : 2,
"max-cpu" : {
"value" : 15.0
}
}
]
}
}
}
Wrapping it up
With this information we can now see which regions and hosts have the most contention for CPU resources. This is just one example of how multi-terms aggregation can be useful. To learn more be sure to check out the documentation page or read the original pull request. Additionally, if you find any issues or have questions, feel free to either create an issue on GitHub or ask a question on the forum!