CONCEPT Cited by 1 source
Scalar quantization¶
Definition¶
Scalar quantization is a lossy compression technique for high-dimensional vector
data that maps each float element independently to a lower-precision numeric
representation (typically int8 or float16), reducing the memory footprint of a
vector index at a small cost in recall. It is "scalar" because each dimension is
quantized independently — contrast with product quantization (PQ) or other
block-based schemes that quantize groups of dimensions jointly.
See also the broader vector quantization concept for the family of techniques; scalar quantization is the simplest member.
Shape¶
For a float32 vector x of dimension d, scalar quantization picks a per-index
(or global) scale/zero-point and maps each x[i] to an int8 (or int4) via
At query time, the query vector is quantized the same way and similarity (dot product,
cosine) is computed in the quantized space. For int8 quantization the memory
footprint drops 4× vs. float32; the similarity computation runs on narrower
integer operations, which also benefits from
SIMD acceleration (more lanes per register when
elements are narrower).
Tradeoff¶
- Memory — 4× reduction for
int8, 2× forfloat16. On large vector indices, this is often the difference between fitting in RAM (where HNSW is viable) and not. - Recall — a small, measurable drop. The usual deployment pattern is to quantize-then-rerank: coarse scan over the quantized index produces top-K* > K candidates, full-precision rescoring picks the final K.
- Latency — usually neutral or faster, because the smaller per-vector size improves memory bandwidth + cache behaviour, which can offset the de-quantize cost for rerank.
Canonical instance: Lucene 10 / Yelp Nrtsearch¶
Lucene 10 added scalar-quantization support to its HNSW vector-search implementation. Yelp Nrtsearch 1.0.0 exposes it as a configurable feature:
"Float vectors may be configured to use scalar quantized values for search, allowing a tradeoff between accuracy and memory usage" (Source: sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)
Why it pairs with HNSW¶
HNSW is RAM-bound — the graph must fit in memory for good performance. Scalar quantization is the lightest-weight lever that moves the RAM ceiling up by a constant factor without changing the index structure or the similarity function. For corpora that almost fit, it's the difference between HNSW being viable and needing to reach for SSD-resident alternatives like DiskANN or SPANN.
Seen in¶
- sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — Nrtsearch 1.0.0 exposes Lucene 10's scalar-quantization support for float vectors as a per-index accuracy/memory tradeoff.
Related¶
- concepts/vector-quantization — the broader family (product, residual, etc.)
- concepts/hnsw-index — the in-memory vector index scalar-q most commonly pairs with
- concepts/vector-similarity-search
- concepts/simd-vectorization — quantized integer ops benefit from SIMD
- systems/hnsw
- systems/lucene