CONCEPT Cited by 1 source

Scalar quantization¶

Definition¶

Scalar quantization is a lossy compression technique for high-dimensional vector data that maps each float element independently to a lower-precision numeric representation (typically int8 or float16), reducing the memory footprint of a vector index at a small cost in recall. It is "scalar" because each dimension is quantized independently — contrast with product quantization (PQ) or other block-based schemes that quantize groups of dimensions jointly.

See also the broader vector quantization concept for the family of techniques; scalar quantization is the simplest member.

Shape¶

For a float32 vector x of dimension d, scalar quantization picks a per-index (or global) scale/zero-point and maps each x[i] to an int8 (or int4) via

q[i] = round((x[i] - zero_point) / scale)

At query time, the query vector is quantized the same way and similarity (dot product, cosine) is computed in the quantized space. For int8 quantization the memory footprint drops 4× vs. float32; the similarity computation runs on narrower integer operations, which also benefits from SIMD acceleration (more lanes per register when elements are narrower).

Tradeoff¶

Memory — 4× reduction for int8, 2× for float16. On large vector indices, this is often the difference between fitting in RAM (where HNSW is viable) and not.
Recall — a small, measurable drop. The usual deployment pattern is to quantize-then-rerank: coarse scan over the quantized index produces top-K* > K candidates, full-precision rescoring picks the final K.
Latency — usually neutral or faster, because the smaller per-vector size improves memory bandwidth + cache behaviour, which can offset the de-quantize cost for rerank.

Canonical instance: Lucene 10 / Yelp Nrtsearch¶

Lucene 10 added scalar-quantization support to its HNSW vector-search implementation. Yelp Nrtsearch 1.0.0 exposes it as a configurable feature:

"Float vectors may be configured to use scalar quantized values for search, allowing a tradeoff between accuracy and memory usage" (Source: sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)

Why it pairs with HNSW¶

HNSW is RAM-bound — the graph must fit in memory for good performance. Scalar quantization is the lightest-weight lever that moves the RAM ceiling up by a constant factor without changing the index structure or the similarity function. For corpora that almost fit, it's the difference between HNSW being viable and needing to reach for SSD-resident alternatives like DiskANN or SPANN.

Seen in¶

sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — Nrtsearch 1.0.0 exposes Lucene 10's scalar-quantization support for float vectors as a per-index accuracy/memory tradeoff.

concepts/vector-quantization — the broader family (product, residual, etc.)
concepts/hnsw-index — the in-memory vector index scalar-q most commonly pairs with
concepts/vector-similarity-search
concepts/simd-vectorization — quantized integer ops benefit from SIMD
systems/hnsw
systems/lucene