CONCEPT Cited by 1 source
Vector Quantization¶
Definition¶
Vector quantization (in the context of vector search) is
compressing embedding vectors from their full-precision
representation (typically float32) to a smaller encoding
(int8, 4-bit, 2-bit, 1-bit / binary, or product-quantization
codebook indices), trading a small reduction in nearest-neighbour
search recall for a substantial reduction in the memory and
storage cost of the index.
(Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)
The cost driver it targets¶
A dense vector index's hot working set is dominated by the raw
vectors themselves. For a 1024-dim float32 embedding, each vector
is 4 KB. A billion-vector index = 4 TB of vector payload
alone, before the ANN structure's auxiliary memory. OpenSearch k-NN
(HNSW by default) keeps vectors in memory for low-latency search, so
the cluster RAM budget scales linearly with corpus size and
dimensionality.
The quantization trade:
float32→int8(8-bit scalar quantization): ~4× memory reduction.float32→ 4-bit: ~8× memory reduction.float32→ 1-bit / binary: up to 32× memory reduction; recall degrades more noticeably.- Product Quantization (PQ): codebook-based, cuts memory even further; recall-accuracy control is more involved.
Each compression step costs some recall. The art is tuning to the recall floor the product tolerates.
OpenSearch k-NN framing¶
OpenSearch's k-NN plugin exposes quantization via its
knn-vector-quantization
feature. By default, k-NN stores each dim as float32; enabling
quantization swaps that storage layout for a smaller representation
during indexing, and the search path decodes (or operates directly in
quantized space, depending on mode) at query time.
"By default, OpenSearch's kNN plugin represents each element of the embedding as a four byte float, but vector quantization is a technique to compress the size of embeddings to reduce the memory required to store and search them, at the cost of a small reduction in nearest neighbor search accuracy." (Figma Engineering, 2026-04-21)
Figma AI Search instance¶
OpenSearch was the second-biggest cost driver in Figma AI Search infrastructure, after the frame-enumeration-and-thumbnailing step. Vector quantization was one of two quantization-era optimizations Figma deployed:
- Pre-quantization scope reduction (see patterns/selective-indexing-heuristics): remove drafts, within- file duplicates, and unmodified file copies — cuts the index in half.
- Vector quantization for the embeddings that survive (1): shrinks each remaining vector in memory.
Figma does not disclose which quantization mode (scalar int8, 4-bit, binary, PQ) or the resulting recall impact.
Quantization vs dimension reduction¶
Worth distinguishing from an adjacent optimization:
| Quantization | Dimension reduction (PCA / learned projection) | |
|---|---|---|
| What changes | Bits per dim | Number of dims |
| Who decides | Index-time policy | Model / training-time choice |
| Recall impact | Small-to-moderate, roughly predictable per bit-level | Heavily dependent on original model's redundancy |
| Where applied | In the vector DB | Before insertion into the DB |
Both reduce per-vector bytes. Quantization leaves the geometry alone and lossy-compresses each coordinate; dim reduction projects the geometry into a lower-rank space.
Relationship to product quantization in ANN literature¶
Classic product quantization (PQ) — splitting each vector into sub-vectors and quantizing each via a learned codebook — is also vector quantization, and is the technique behind many disk- based ANN systems (DiskANN, FAISS with PQ). OpenSearch k-NN supports PQ among its quantization modes. The term "vector quantization" in product-engineering writing often leaves the specific mode implicit; if the post doesn't name it, don't assume.
When to apply¶
- Memory-bound vector indexes at billion-scale. The canonical fit — RAM is expensive, per-vector bytes dominate.
- Products that can tolerate small recall drops. A 2% recall regression is often invisible to users when ranking is layered on top (lexical fusion, post-retrieval reranker, learned ranker).
- When ANN structure itself is already in use. You're already in the "ANN trades recall for latency" regime; quantization is the storage lever in the same trade space.
Don't apply when:
- Recall is an absolute product requirement (safety, legal discovery, certain scientific retrieval).
- Corpus is small enough that exact k-NN fits in RAM at
float32. Don't pay recall cost you don't need. - Embedding model training/distribution is sensitive to quantization noise. Some embeddings (particularly short or dense ones) degrade more than others; always bench against the eval set.
Caveats¶
- Recall measurement is mandatory. "Small recall hit" in vendor docs is model- and corpus-specific. Offline-eval the quantization mode against the same labelled queries the product is actually judged on.
- Index re-quantization is not free. Changing quantization mode requires reindexing the corpus (if not all, the newly-quantized portion). Plan rollouts.
- Quantization interacts with distance metric. Asymmetric modes
(query in
float32, corpus inint8) recover some recall at the cost of query-side compute. Read the docs for the specific plugin / library. - Naming overlap. Vector quantization in ML / CV literature can also refer to unrelated concepts (VQ-VAE in generative modeling). Context disambiguates.
See also¶
- concepts/vector-embedding — what's being quantized.
- concepts/vector-similarity-search — the retrieval primitive quantization modifies.
- concepts/quantization — the more general numerical- representation compression concept (applied here to vectors, but also to model weights, activations, gradients).
- systems/amazon-opensearch-service — the k-NN plugin where Figma's quantization is applied.
- systems/figma-ai-search — canonical instance.
Seen in¶
- sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma — OpenSearch k-NN vector quantization deployed to reduce cluster memory footprint; named explicitly as a recall-for-memory trade.