Skip to content

CONCEPT Cited by 1 source

Vector Quantization

Definition

Vector quantization (in the context of vector search) is compressing embedding vectors from their full-precision representation (typically float32) to a smaller encoding (int8, 4-bit, 2-bit, 1-bit / binary, or product-quantization codebook indices), trading a small reduction in nearest-neighbour search recall for a substantial reduction in the memory and storage cost of the index.

(Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)

The cost driver it targets

A dense vector index's hot working set is dominated by the raw vectors themselves. For a 1024-dim float32 embedding, each vector is 4 KB. A billion-vector index = 4 TB of vector payload alone, before the ANN structure's auxiliary memory. OpenSearch k-NN (HNSW by default) keeps vectors in memory for low-latency search, so the cluster RAM budget scales linearly with corpus size and dimensionality.

The quantization trade:

  • float32int8 (8-bit scalar quantization): ~4× memory reduction.
  • float32 → 4-bit: ~8× memory reduction.
  • float32 → 1-bit / binary: up to 32× memory reduction; recall degrades more noticeably.
  • Product Quantization (PQ): codebook-based, cuts memory even further; recall-accuracy control is more involved.

Each compression step costs some recall. The art is tuning to the recall floor the product tolerates.

OpenSearch k-NN framing

OpenSearch's k-NN plugin exposes quantization via its knn-vector-quantization feature. By default, k-NN stores each dim as float32; enabling quantization swaps that storage layout for a smaller representation during indexing, and the search path decodes (or operates directly in quantized space, depending on mode) at query time.

"By default, OpenSearch's kNN plugin represents each element of the embedding as a four byte float, but vector quantization is a technique to compress the size of embeddings to reduce the memory required to store and search them, at the cost of a small reduction in nearest neighbor search accuracy." (Figma Engineering, 2026-04-21)

Figma AI Search instance

OpenSearch was the second-biggest cost driver in Figma AI Search infrastructure, after the frame-enumeration-and-thumbnailing step. Vector quantization was one of two quantization-era optimizations Figma deployed:

  1. Pre-quantization scope reduction (see patterns/selective-indexing-heuristics): remove drafts, within- file duplicates, and unmodified file copies — cuts the index in half.
  2. Vector quantization for the embeddings that survive (1): shrinks each remaining vector in memory.

Figma does not disclose which quantization mode (scalar int8, 4-bit, binary, PQ) or the resulting recall impact.

Quantization vs dimension reduction

Worth distinguishing from an adjacent optimization:

Quantization Dimension reduction (PCA / learned projection)
What changes Bits per dim Number of dims
Who decides Index-time policy Model / training-time choice
Recall impact Small-to-moderate, roughly predictable per bit-level Heavily dependent on original model's redundancy
Where applied In the vector DB Before insertion into the DB

Both reduce per-vector bytes. Quantization leaves the geometry alone and lossy-compresses each coordinate; dim reduction projects the geometry into a lower-rank space.

Relationship to product quantization in ANN literature

Classic product quantization (PQ) — splitting each vector into sub-vectors and quantizing each via a learned codebook — is also vector quantization, and is the technique behind many disk- based ANN systems (DiskANN, FAISS with PQ). OpenSearch k-NN supports PQ among its quantization modes. The term "vector quantization" in product-engineering writing often leaves the specific mode implicit; if the post doesn't name it, don't assume.

When to apply

  • Memory-bound vector indexes at billion-scale. The canonical fit — RAM is expensive, per-vector bytes dominate.
  • Products that can tolerate small recall drops. A 2% recall regression is often invisible to users when ranking is layered on top (lexical fusion, post-retrieval reranker, learned ranker).
  • When ANN structure itself is already in use. You're already in the "ANN trades recall for latency" regime; quantization is the storage lever in the same trade space.

Don't apply when:

  • Recall is an absolute product requirement (safety, legal discovery, certain scientific retrieval).
  • Corpus is small enough that exact k-NN fits in RAM at float32. Don't pay recall cost you don't need.
  • Embedding model training/distribution is sensitive to quantization noise. Some embeddings (particularly short or dense ones) degrade more than others; always bench against the eval set.

Caveats

  • Recall measurement is mandatory. "Small recall hit" in vendor docs is model- and corpus-specific. Offline-eval the quantization mode against the same labelled queries the product is actually judged on.
  • Index re-quantization is not free. Changing quantization mode requires reindexing the corpus (if not all, the newly-quantized portion). Plan rollouts.
  • Quantization interacts with distance metric. Asymmetric modes (query in float32, corpus in int8) recover some recall at the cost of query-side compute. Read the docs for the specific plugin / library.
  • Naming overlap. Vector quantization in ML / CV literature can also refer to unrelated concepts (VQ-VAE in generative modeling). Context disambiguates.

See also

Seen in

Last updated · 200 distilled / 1,178 read