Platform

Capabilities

Community

Documentation

Most Recent Articles
Meet OSCAR: The AI chat assistant that makes OpenSearch releases easier for e...	Sep 03
Introducing OpenSearch Benchmark 2.0	Aug 21
OpenSearch MCP server: Empowering the Amazon EC2 team's droplet analysis	Aug 18
Diving deep into distributed microservices with OpenSearch and OpenTelemetry	Aug 12
Transforming bucket aggregations: Our journey to 100x performance improvement	Aug 05
Performance optimizations for the OpenSearch security layer	Jul 30
Scaling vector generation: Batch ML inference with OpenSearch Ingestion and M...	Jul 30
OpenSearch User Groups: Building a Global Community	Jul 18
OpenSearch Approximation Framework	Jul 14
The new semantic field: Simplifying semantic search in OpenSearch	Jul 11

SQL search relevance, selective aggregation, and window functions in OpenSearch

Mon, Sep 12, 2022 · Chen Dai, Joshua Bright

According to a review by IEEE Spectrum in 2022, SQL is the sixth most popular programming language. IEEE came to this conclusion by pulling and weighting data across GitHub, Google, Stack Overflow, Twitter, and IEEE Xplore. Did you know that OpenSearch offers a way to query OpenSearch using SQL? In recent releases, the SQL plugin included support for search relevance, selective aggregation, and window functions. You can use the REST API or use the OpenSearch Query Workbench to query OpenSearch using the SQL plugin. Today, we will use the Query Workbench inside the OpenSearch Playground environment to walk through search relevance, selective aggregation, and window functions.

We’ll use sample data that everyone can access through the playground environment or the demo Docker install. First, add “Sample web logs” if they aren’t already added. Next, open the Query Workbench. Now let’s dive in. Let’s say that you were a security researcher and wanted to keep track of your system activity. You can use search relevance, selective aggregation, and window functions to monitor this activity.

Search relevance

Search relevance is a powerful tool that allows you to search through data, such as logs, to find specific documents. In the example below you can use count to determine how many logs fit your selected criteria.

Search through logs by client IP addresses and error counts for GET requests resulting in a 503 server error:

SELECT clientip, COUNT(*) AS cnt
FROM opensearch_dashboards_sample_data_logs
WHERE MATCH(message, "GET 503", operator="AND")
GROUP BY clientip
ORDER BY cnt DESC

Additionally, the highlight function returns text with matched terms highlighted:

SELECT clientip, HIGHLIGHT(message)
FROM opensearch_dashboards_sample_data_logs
WHERE MATCH_PHRASE(message, "Linux x86_64")

Selective aggregation

Selective aggregation is useful when you want to understand a total of something across an entire dataset. The SQL plugin will also allow you to filter with certain criteria. Say you want to keep track of large file transfers in your system in order to react to someone moving sensitive data outside of your place of work. In the example below, selective aggregation is used to determine how many HTTP requests were submitted where the bytes count was higher than 10,000:

SELECT
    COUNT(*) AS totalReq,
    COUNT(*) FILTER(WHERE bytes > 10000) AS totalLargeReq
FROM opensearch_dashboards_sample_data_logs

Window functions

Aggregate window functions provide users an easy way to calculate aggregate results over a custom window of time. For example, if you wanted to understand how many web calls occurred each day in order to identify abnormal traffic patterns, you could use the following statement:

SELECT
    date,
    SUM(dailyCnt) OVER(ORDER BY date) AS cumulative
FROM (
    SELECT
     DATE_FORMAT(timestamp, '%Y/%m/%d') AS date,
     COUNT(*) AS dailyCnt
    FROM opensearch_dashboards_sample_data_logs
    GROUP BY date
) AS tmp
ORDER BY date

But what if you wanted to monitor the top 10 websites visited by employees of your company? You could use a classic top-K elements per group problem. The following statement queries the top 10 most visited URLs for each day by using the ranking window function:

SELECT date, url, cnt
FROM (
    SELECT
     date, url, cnt,
     RANK() OVER(PARTITION BY date ORDER BY cnt DESC) AS rnk
    FROM (
      SELECT
        DATE_FORMAT(timestamp, '%Y/%m/%d') AS date,
        url AS url,
        COUNT(*) AS cnt
      FROM opensearch_dashboards_sample_data_logs
      GROUP BY date, url
    ) AS a
) AS r
WHERE rnk <= 10
ORDER BY date, rnk

How are you using OpenSearch’s SQL in your environment? Do you have use cases that have worked for you? We would love to hear about it! As always, if you don’t see functionality you would like or want to get involved, please visit the SQL Github repository to file an issue.

« OpenSearch 2.3.0 is ready for download! Snapshot Operations in OpenSearch »

Blog

SQL search relevance, selective aggregation, and window functions in OpenSearch

Search relevance

Selective aggregation

Window functions

Participate

Providers

Resources