Skip to content

CONCEPT Cited by 2 sources

ANN (approximate nearest neighbor) index

Definition

An ANN (approximate nearest neighbor) index is a data structure + serving system that, given a query vector, returns the top-K closest item vectors from a large pre-indexed corpus — approximately, not exactly, in exchange for sub-linear search cost. It is the serving artifact that makes embedding-based retrieval affordable at production scale.

In a two-tower retrieval / ranking system, the item tower's embeddings are written into an ANN index offline, and at request time the query tower's embedding is used to query the index for the top-K most similar items (by dot product, cosine, or L2 distance).

SilverTorch face — ANN as part of the model itself (2026-05-26)

Meta's SilverTorch post (Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems) inverts the serving-artifact framing introduced below. Under the Index as Model paradigm, the ANN index "becomes a tensor inside the model" — there is no longer a separate serving artifact built on a slower cadence than the model. The index is the model.

Verbatim:

"SilverTorch reimplements ANN search as part of the model itself. It stores item embeddings in a compact Int8 format, which cuts memory use roughly in half compared to typical 16 bits, and runs search with a fused GPU kernel. ... The algorithm supports large top-k and probe counts; in practice, we observe no retrieval recall loss with 64 probes and top-2048."

This face produces several departures from the canonical wiki ANN-index framing:

  • Top-K capability: "hundreds of thousands" (vs Faiss-GPU's 2,048 ceiling).
  • Per-kernel speedup: 2.2–14.7× faster than Faiss-GPU (the fused Int8 ANN primitive).
  • Index freshness mechanism: streaming in-place tensor mutation instead of rebuild-and-swap. "With index as a model module, maintaining index freshness equates to updating the model weights of a neural network in production, at scale, without taking the model offline."
  • No more rebuild-cadence-vs-model-cadence skew — see "Update cadence + version skew" below for the prior failure mode this resolves.

The patterns/gpu-native-retrieval-primitive-redesign pattern names the design philosophy (redesign the primitive around GPU memory layout + tensor execution, don't port the CPU-era version). SilverTorch supersedes Faiss-GPU inside Meta's recsys retrieval surfaces, not Faiss-the-library across all Meta search/retrieval workloads — systems/meta-groups-scoped-search continues to use Faiss as the production ANN substrate.

Why it's used

Exact k-NN over millions to billions of vectors is O(N·D) per query — intractable at request volumes typical for ads ranking, search, or recommendation. ANN indices trade an accepted recall-vs-latency tradeoff (usually ≥95% recall at O(log N) or better) for orders-of-magnitude faster retrieval.

Typical algorithmic families:

  • HNSW (Hierarchical Navigable Small World graph) — graph-based; state-of-the-art recall/latency; popular in practice (Lucene, FAISS, Vespa, Qdrant).
  • IVF / IVFPQ (inverted file + product quantization) — partitioning + compression; used in FAISS, Milvus.
  • Annoy (Spotify's random projection trees) — read-only, simple.
  • ScaNN (Google) — learned quantization + pruning.

Role in production recommendation systems

An ANN index is the serving artifact for item embeddings in production recommendation / ads / search systems. Candidates flow through it at several points in the funnel:

  • Retrieval — generate candidate set from billions of items.
  • Early ranking (e.g., Pinterest L1) — narrow further under tight latency before expensive downstream ranking.
  • Similar-item / related-item surfaces — direct user-facing applications of k-NN.

The serving-artifact distinction

A crucial production-engineering point, central to Pinterest's 2026-02-27 O/O retrospective (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr):

"It's not enough for features to exist in training logs or the Feature Store — they also need to be present in the serving artifacts (like ANN indices) that L1 actually uses to serve traffic."

The ANN index is built from a different feature pipeline than the one the model trained on, and often a different pipeline than the L2 Feature Store that downstream stages consume. A feature that's in training logs + the Feature Store but never onboarded into the ANN-index build path is effectively invisible to any stage that reads from that index — causing silent online-offline discrepancy.

Update cadence + version skew

ANN indices are typically rebuilt on a cadence much slower than model-release cadence: hourly snapshots for streaming enrichment, multi-day full rebuilds on large tiers at Pinterest scale. This means:

  • The index holds a mix of embedding versions at any moment.
  • Query-side embeddings (which run at request time from the live query tower) refresh instantly on model rollout.
  • Item-side embeddings (which must propagate through snapshot + rebuild + deploy) lag by hours to days.

This structural cadence mismatch produces embedding version skew, a specific cause of O/O discrepancy in two-tower systems. Pinterest mitigates by favoring batch embedding inference for large tiers so each rebuild uses a single consistent checkpoint.

Design axes

  • Recall target — how close to exact k-NN the index must come; drives algorithm + parameter choice.
  • Latency budget — how much query-time compute is acceptable.
  • Build-time budget — how quickly the index can be rebuilt + deployed; caps refresh cadence.
  • Memory footprint — HNSW graphs are memory-hungry; PQ-style indices trade memory for recall.
  • Update pattern — streaming upserts vs batch rebuilds.

Seen in

- sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — Pinterest L1 ranking uses ANN index for Pin embeddings; post documents how features missing from the ANN-index build path (distinct from L2 Feature Store) caused silent training-serving gap; index build + deploy "can span days" on large tiers, driving embedding version skew.

— canonical wiki statement of the structural rejection of pure-graph ANN families (HNSW + DiskANN) as the index inside a relational engine, and the structural adoption of the SPANN + SPFresh hybrid tree + graph family instead. Adds three new ANN-index subtypes to the wiki: concepts/hnsw-index, concepts/diskann-index, concepts/spann-index, concepts/spfresh-index; two new architectural shape concepts: concepts/transactional-vector-index + concepts/incremental-vector-index; two new patterns: patterns/hybrid-tree-graph-ann-index + patterns/vector-index-inside-storage-engine.

Last updated · 542 distilled / 1,571 read