Most Recent Articles
Introducing OpenSearch Benchmark 2.0	Aug 21
Diving deep into distributed microservices with OpenSearch and OpenTelemetry	Aug 12
Transforming bucket aggregations: Our journey to 100x performance improvement	Aug 05
Performance optimizations for the OpenSearch security layer	Jul 30
Scaling vector generation: Batch ML inference with OpenSearch Ingestion and M...	Jul 30
OpenSearch User Groups: Building a Global Community	Jul 18
OpenSearch Approximation Framework	Jul 14
The new semantic field: Simplifying semantic search in OpenSearch	Jul 11
Advanced usage of the semantic field in OpenSearch	Jul 10
Making ingestion smarter: System ingest pipelines in OpenSearch	Jul 08

Reduce costs with disk-based vector search

Wed, Feb 19, 2025 · Jack Mazanec, Vikash Tiwari, Vamshi Vijay Nakkirtha, Fanit Kolchina

Vector search has gained significant attention in the area of information retrieval thanks to advances in natural language embedding models. These models map domain-specific data into a vector space, where similar pieces of data are positioned close to each other. When you run a search, the query is embedded into this space, and a nearest neighbor search identifies the most similar results based on distance calculations.

Vectors are arrays of fixed dimensions that store numeric values (float, byte, or bit). The shape of the vector and the function used to compare vectors are determined by the embedding model. For example, when you pass a chunk of text to the Cohere v3 multilingual embedding model, the model returns a 1,024-dimensional vector of 32-bit floating-point numbers representing the text in the vector space, as shown in the following image.

To determine similarity, the “distance” between vectors is computed using functions like the dot product or the Euclidean metric. The choice of distance function is typically dictated by the model that generates the embeddings. Vector search excels at capturing the implicit meaning of data, often leading to more accurate results. Additionally, combining vector search with traditional text-based search can further improve search quality.

The memory footprint of vector search

Vector-based indexes can be very large. For example, storing 1 billion 768-dimensional floating-point embeddings requires approximately 1000^3 * 768 * 4 ~= 2861 GB of storage. This cost increases further when replicas are enabled and additional metadata is added. Traditionally, many efficient nearest neighbor algorithms require vectors to be fully resident in memory to enable fast search because search involves random access patterns. This makes it difficult to efficiently retrieve vectors from disk and leads to high costs as the number of vectors increases, as shown in the following figure.

To reduce the memory footprint of vector indexes, vector quantization techniques are typically employed. Quantization compresses vectors into smaller representations while trying to minimize information loss. For example, FP16 quantization converts 32-bit floats to 16-bit floats, reducing memory usage by half. Because many datasets do not require the full 32-bit space, we do not commonly see a reduction in accuracy with this approach.

In addition to FP16, quantization can be used to compress vectors even further, with compression factors of 4x, 8x, 16x, 32x, and beyond. Reducing memory usage further translates to significant cost savings.

However, quantization comes with a trade-off: The more compression that is applied, the greater the potential reduction in search accuracy. Fortunately, with improvements in secondary storage, it’s possible to trade some search latency for accurate nearest neighbor search in a low-memory environment.

Disk-based vector search in OpenSearch

In 2.17, we introduced an on_disk mode for knn_vector fields that enables disk-based vector search in OpenSearch, allowing searches to run in lower-memory environments. This is achieved using a two-phase query approach, illustrated in the following diagram.

First, during ingestion, we build indexes using configurable quantization mechanisms and store the full-precision vectors on disk. The quantized indexes, which reside fully in memory during search, require significantly less memory due to compression.

During search, we first query the quantized indexes to retrieve more than k candidate results. Then we lazily load the full-precision vectors of these candidates into memory and recompute the distance to the query. This reordering step ensures accuracy, and we return only the top k results.

Empirically, this two-phase approach has demonstrated the ability to produce high-recall results across a variety of quantization techniques.

Introducing online binary quantization

As part of this feature, we also expanded our quantization support. Previously, achieving ≥8x compression in OpenSearch required using product quantization. While effective, product quantization requires a training step before ingestion can begin and requires users to manage their models.

In 2.17, we introduced several new online quantization techniques for 8x, 16x, and 32x compression—no pretraining needed. You can begin ingestion immediately without additional model preparation. This simplifies the process and reduces overhead. Take a look at this technical deep dive to learn how we developed these capabilities.

Consider the following 8-dimensional vectors of 32-bit floating-point numbers:

v1 = [0.56, 0.85, 0.53, 0.25, 0.46, 0.01, 0.63, 0.73]
v2 = [-0.99, -0.79, 0.23, -0.62, 0.87, -0.06, -0.24, -0.75]
v3 = [-0.15, 0.17, 0.10, 0.46, -0.79, -0.31, 0.36, -1.00]

In order to quantize them, we perform the following steps.

1. Calculate mean per dimension

The mean for each dimension j is calculated using the following formula:

The preceding set of vectors produces the following calculated mean:

Mean = [-0.19, 0.08, 0.29, 0.03, 0.18, -0.12, 0.25, -0.34]

2. Quantization of a new vector

The quantization rule for each dimension j of a vector is given by the following logic:

Consider a new vector:

v_new = [0.45, -0.30, 0.67, 0.12, 0.25, -0.50, 0.80, 0.55]

To quantize this vector, we compare each dimension to the mean value. When a dimension exceeds the mean, we convert it to 1; otherwise, we convert it to 0, as shown in the following diagram.

Try disk-based vector search

To set up disk-based vector search, simply set the mode to on_disk in the index mappings. This parameter automatically configures a low-memory setting with rescoring enabled:

PUT my-vector-index
{
  "settings" : {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "mode": "on_disk"
      }
    }
  }
}

To tune the quantization, you can specify the compression_level in the mapping. By default for on_disk mode, it is 32x:

PUT my-vector-index
{
  "settings" : {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "mode": "on_disk",
        "compression_level": "16x"
      }
    }
  }
}

At search time, if mode is set to on_disk, OpenSearch automatically applies a two-phase search process:

GET my-vector-index/_search
{
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],
        "k": 5
      }
    }
  }
}

When you use on_disk mode, OpenSearch configures an oversample factor to return more than k results from the quantized index search. You can override this default setting by specifying an oversample_factor in the query body:

GET my-vector-index/_search
{
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],
        "k": 5,
        "rescore": {
            "oversample_factor": 10.0
        }
      }
    }
  }
}

For more information, see Disk-based vector search.

Experiments

To evaluate the performance of disk-based vector search, we conducted several experiments using various dataset sizes and configurations.

One million vector tests

We ran several different tests on a single-node OpenSearch cluster for datasets containing between 1 and 10 million vectors.

Name	Space type	Normalized	Dimension	Index vector count	Notes
`sift`	`l2`	No	128	1,000,000	Classic scale-invariant feature transform (SIFT) image descriptor dataset.
`minillm-msmarco`	`l2`	Yes	384	1,000,000	Used `minillm` to encode a sample of `ms-marco`.
`mpnet-msmarco`	`l2`	Yes	768	1,000,000	Used `mpnet` to encode a sample of `ms-marco`.
`mxbai-msmarco`	`cosine`	Yes	1,024	1,000,000	Used `mxbai` to encode a sample of `ms-marco`.
`tasb-msmarco`	`ip`	No	768	1,000,000	Used `tasb` to encode a sample of `ms-marco`.
`e5small-msmarco`	`l2`	Yes	384	8,841,823	Used `e5small` to encode a larger sample of `ms-marco`.
`sNowflake-msmarco`	`l2`	Yes	768	8,841,823	Used `snowflake-arctic-embed-m` to encode a larger sample of `ms-marco`.
`clip-flickr`	`l2`	No	512	6,637,685	Used the `clip` model to encode a random sample of images from Flickr.

We conducted experiments comparing in_memory and on_disk modes using 1x, 2x, 4x, 8x, 16x, and 32x compression levels. Key testing parameters included the following:

on_disk mode: Includes a rescoring stage
in_memory mode: Does not include a rescoring stage
Hardware configuration:
- in_memory tests: Amazon Elastic Compute Cloud (Amazon EC2) r6g.4xlarge instances
- on_disk tests: Amazon EC2 r6gd.4xlarge instances with an attached solid-state drive (SSD)—as opposed to Amazon Elastic Block Store (Amazon EBS) storage
Resource controls: Used Docker to manage OpenSearch’s memory and CPU allocation

We obtained the following results, comparing recall to compression level.

The results show that combining quantization with full-precision rescoring can significantly improve recall while maintaining reasonable latency in low-memory environments. However, performance varies by dataset. For example, while this approach improves the performance of the sift dataset, the recall remains relatively low. We recommend testing using your specific dataset to determine the optimal configuration.

Large-scale tests

In addition to the smaller-scale tests, we also ran a test on a larger dataset, MS MARCO 2.1 encoded with the Cohere v3 model. We set up several different tests to showcase how this disk-based vector search compares to in_memory vector search at 8x, 16x, and 32x compression levels. We ran these tests using the following configuration.

Dataset	MS MARCO 2.1 encoded with the Cohere v3 model
Dimension	1,024
Normalized	Yes
Space type	Cosine (inner product over normalized data)
Index vectors	113M

We tested four different cluster configurations using the opensearch-cluster-cdk with OpenSearch 2.18. We selected the cluster and index configuration to follow production recommendations. For example, we configured replica shards and dedicated cluster manager nodes. In addition, we targeted a shard count that provides 2 to 3 vCPUs per shard. The following table presents the configurations for these tests.

Name	Data node count	Data node type	Data node disk size	Data node disk type	JVM size	Primary shard count	Replica shards	Compression level
`in_memory`	8	r6g.8xlarge	300	EBS	32	40	1	1x
`on_disk_8x`	10	r6gd.2xlarge	474	Instance	32	15	1	8x
`on_disk_16x`	6	r6gd.2xlarge	474	Instance	32	9	1	16x
`on_disk_32x`	4	r6gd.2xlarge	474	Instance	32	6	1	32x

In the table, note that clusters with higher compression levels use significantly fewer resources than those without compression.

To optimize performance in our tests, we made these additional adjustments:

Retrieved IDs from doc_values to reduce fetch time
Disabled _source storage

We ran all tests using OpenSearch Benchmark vector search workloads, following this procedure:

Ingest the dataset.
Force merge the index to 5 segments per shard.
Run warm-up queries to load the index into memory.
Run the single-client latency test.
Run the multi-client throughput test.
Repeat steps 4 and 5 with disk-based rescoring disabled to measure performance on a compressed index without rescoring.

We obtained the following large-scale test results.

Metric/configuration	`in-memory`	`on_disk_8x`	`in_memory_8x`	`on_disk_16x`	`in_memory_16x`	`on_disk_32x`	`in_memory_32x`
recall@100 (ratio)	0.95	0.98	0.98	0.97	0.96	0.94	0.95
1-client p90 search latency (ms)	24.02	96.31	28.90	108.05	29.79	104.421	47.2906
1-client mean throughput (QPS)	40.64	11.03	42.33	10.06	43.95	10.65	31.83
4-client p90 search latency (ms)	25.82	97.19	20.40	220.19	18.09	244.52	46.13
4-client mean throughput (QPS)	162.80	45.86	204.62	25.80	193.27	25.28	146.77
8-client p90 search latency (ms)	27.69	95.05	30.88	414.20	27.94	429.79	25.07
8-client mean throughput (QPS)	306.70	95.22	305.86	26.34	343.95	25.75	376.60

Interestingly, for this dataset, the on-disk approach with rescoring produces similar recall to the in-memory approach without rescoring, but the in-memory approach is substantially faster. This is most likely because the Cohere v3 model has been optimized to work very well with binary quantized data (see this blog post).

Learnings

Our testing shows that the two-phase approximate nearest neighbor approach performs effectively in low-memory environments, though results vary significantly by dataset. When running your own experiments, we recommend testing with the index.knn.disk.vector.shard_level_rescoring_disabled setting both enabled and disabled to measure the performance benefit for your use case. Additionally, with disk-based search, ensure that your secondary storage is optimized for high read traffic—we found that SSDs generally provide the best results.

What’s next?

We have many new and exciting features coming for vector search in OpenSearch. In future versions, we’ll focus on improving quantization performance for all datasets, eliminating the need for fine-tuning. Follow our GitHub repo for continued improvements in both performance and functionality. As always, we welcome and appreciate your contributions and feature requests!

« Efficient large-scale filtering with bitmap filtering in OpenSearch From chaos to clarity: Revolutionizing OpenSearch clients and documentation using a unified API specification »

Blog

Reduce costs with disk-based vector search

The memory footprint of vector search

Disk-based vector search in OpenSearch

Introducing online binary quantization

1. Calculate mean per dimension

2. Quantization of a new vector

Try disk-based vector search

Experiments

One million vector tests

Large-scale tests

Learnings

What’s next?

Participate

Providers

Resources

Platform

Capabilities

Community

Documentation

Blog

Reduce costs with disk-based vector search

The memory footprint of vector search

Disk-based vector search in OpenSearch

Introducing online binary quantization

1. Calculate mean per dimension

2. Quantization of a new vector

Try disk-based vector search

Experiments

One million vector tests

Large-scale tests

Learnings

What’s next?

Jack Mazanec

Vikash Tiwari

Vamshi Vijay Nakkirtha

Fanit Kolchina