Skip to content

CONCEPT Cited by 1 source

Scalar quantization

Definition

Scalar quantization is a lossy compression technique for high-dimensional vector data that maps each float element independently to a lower-precision numeric representation (typically int8 or float16), reducing the memory footprint of a vector index at a small cost in recall. It is "scalar" because each dimension is quantized independently — contrast with product quantization (PQ) or other block-based schemes that quantize groups of dimensions jointly.

See also the broader vector quantization concept for the family of techniques; scalar quantization is the simplest member.

Shape

For a float32 vector x of dimension d, scalar quantization picks a per-index (or global) scale/zero-point and maps each x[i] to an int8 (or int4) via

q[i] = round((x[i] - zero_point) / scale)

At query time, the query vector is quantized the same way and similarity (dot product, cosine) is computed in the quantized space. For int8 quantization the memory footprint drops 4× vs. float32; the similarity computation runs on narrower integer operations, which also benefits from SIMD acceleration (more lanes per register when elements are narrower).

Tradeoff

  • Memory — 4× reduction for int8, 2× for float16. On large vector indices, this is often the difference between fitting in RAM (where HNSW is viable) and not.
  • Recall — a small, measurable drop. The usual deployment pattern is to quantize-then-rerank: coarse scan over the quantized index produces top-K* > K candidates, full-precision rescoring picks the final K.
  • Latency — usually neutral or faster, because the smaller per-vector size improves memory bandwidth + cache behaviour, which can offset the de-quantize cost for rerank.

Canonical instance: Lucene 10 / Yelp Nrtsearch

Lucene 10 added scalar-quantization support to its HNSW vector-search implementation. Yelp Nrtsearch 1.0.0 exposes it as a configurable feature:

"Float vectors may be configured to use scalar quantized values for search, allowing a tradeoff between accuracy and memory usage" (Source: sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)

Why it pairs with HNSW

HNSW is RAM-bound — the graph must fit in memory for good performance. Scalar quantization is the lightest-weight lever that moves the RAM ceiling up by a constant factor without changing the index structure or the similarity function. For corpora that almost fit, it's the difference between HNSW being viable and needing to reach for SSD-resident alternatives like DiskANN or SPANN.

Seen in

Last updated · 550 distilled / 1,221 read