CONCEPT Cited by 1 source

Vector Quantization¶

Definition¶

Vector quantization (in the context of vector search) is compressing embedding vectors from their full-precision representation (typically float32) to a smaller encoding (int8, 4-bit, 2-bit, 1-bit / binary, or product-quantization codebook indices), trading a small reduction in nearest-neighbour search recall for a substantial reduction in the memory and storage cost of the index.

(Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)

The cost driver it targets¶

A dense vector index's hot working set is dominated by the raw vectors themselves. For a 1024-dim float32 embedding, each vector is 4 KB. A billion-vector index = 4 TB of vector payload alone, before the ANN structure's auxiliary memory. OpenSearch k-NN (HNSW by default) keeps vectors in memory for low-latency search, so the cluster RAM budget scales linearly with corpus size and dimensionality.

The quantization trade:

float32 → int8 (8-bit scalar quantization): ~4× memory reduction.
float32 → 4-bit: ~8× memory reduction.
float32 → 1-bit / binary: up to 32× memory reduction; recall degrades more noticeably.
Product Quantization (PQ): codebook-based, cuts memory even further; recall-accuracy control is more involved.

Each compression step costs some recall. The art is tuning to the recall floor the product tolerates.

OpenSearch k-NN framing¶

OpenSearch's k-NN plugin exposes quantization via its knn-vector-quantization feature. By default, k-NN stores each dim as float32; enabling quantization swaps that storage layout for a smaller representation during indexing, and the search path decodes (or operates directly in quantized space, depending on mode) at query time.

"By default, OpenSearch's kNN plugin represents each element of the embedding as a four byte float, but vector quantization is a technique to compress the size of embeddings to reduce the memory required to store and search them, at the cost of a small reduction in nearest neighbor search accuracy." (Figma Engineering, 2026-04-21)

Figma AI Search instance¶

OpenSearch was the second-biggest cost driver in Figma AI Search infrastructure, after the frame-enumeration-and-thumbnailing step. Vector quantization was one of two quantization-era optimizations Figma deployed:

Pre-quantization scope reduction (see patterns/selective-indexing-heuristics): remove drafts, within- file duplicates, and unmodified file copies — cuts the index in half.
Vector quantization for the embeddings that survive (1): shrinks each remaining vector in memory.

Figma does not disclose which quantization mode (scalar int8, 4-bit, binary, PQ) or the resulting recall impact.

Quantization vs dimension reduction¶

Worth distinguishing from an adjacent optimization:

	Quantization	Dimension reduction (PCA / learned projection)
What changes	Bits per dim	Number of dims
Who decides	Index-time policy	Model / training-time choice
Recall impact	Small-to-moderate, roughly predictable per bit-level	Heavily dependent on original model's redundancy
Where applied	In the vector DB	Before insertion into the DB

Both reduce per-vector bytes. Quantization leaves the geometry alone and lossy-compresses each coordinate; dim reduction projects the geometry into a lower-rank space.

Relationship to product quantization in ANN literature¶

Classic product quantization (PQ) — splitting each vector into sub-vectors and quantizing each via a learned codebook — is also vector quantization, and is the technique behind many disk- based ANN systems (DiskANN, FAISS with PQ). OpenSearch k-NN supports PQ among its quantization modes. The term "vector quantization" in product-engineering writing often leaves the specific mode implicit; if the post doesn't name it, don't assume.

When to apply¶

Memory-bound vector indexes at billion-scale. The canonical fit — RAM is expensive, per-vector bytes dominate.
Products that can tolerate small recall drops. A 2% recall regression is often invisible to users when ranking is layered on top (lexical fusion, post-retrieval reranker, learned ranker).
When ANN structure itself is already in use. You're already in the "ANN trades recall for latency" regime; quantization is the storage lever in the same trade space.

Don't apply when:

Recall is an absolute product requirement (safety, legal discovery, certain scientific retrieval).
Corpus is small enough that exact k-NN fits in RAM at float32. Don't pay recall cost you don't need.
Embedding model training/distribution is sensitive to quantization noise. Some embeddings (particularly short or dense ones) degrade more than others; always bench against the eval set.

Caveats¶

Recall measurement is mandatory. "Small recall hit" in vendor docs is model- and corpus-specific. Offline-eval the quantization mode against the same labelled queries the product is actually judged on.
Index re-quantization is not free. Changing quantization mode requires reindexing the corpus (if not all, the newly-quantized portion). Plan rollouts.
Quantization interacts with distance metric. Asymmetric modes (query in float32, corpus in int8) recover some recall at the cost of query-side compute. Read the docs for the specific plugin / library.
Naming overlap. Vector quantization in ML / CV literature can also refer to unrelated concepts (VQ-VAE in generative modeling). Context disambiguates.

Seen in¶

sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma — OpenSearch k-NN vector quantization deployed to reduce cluster memory footprint; named explicitly as a recall-for-memory trade.
— PlanetScale's GA announcement discloses the quantization menu for vector indexes inside InnoDB: "both fixed and product quantization. Fixed quantization down to one bit per field is crazy fast, or just crazy, depending on your needs." Fixed quantization = scalar per-field quantization; product quantization = codebook-based. The 1-bit-per-field extreme is the most aggressive memory- footprint setting and PlanetScale's own framing acknowledges the recall trade-off qualitatively (no recall curve disclosed). First wiki datum on a transactional RDBMS-hosted ANN index exposing this full quantization spectrum.