Skip to content

CONCEPT Cited by 2 sources

ANN (approximate nearest neighbor) index

Definition

An ANN (approximate nearest neighbor) index is a data structure + serving system that, given a query vector, returns the top-K closest item vectors from a large pre-indexed corpus — approximately, not exactly, in exchange for sub-linear search cost. It is the serving artifact that makes embedding-based retrieval affordable at production scale.

In a two-tower retrieval / ranking system, the item tower's embeddings are written into an ANN index offline, and at request time the query tower's embedding is used to query the index for the top-K most similar items (by dot product, cosine, or L2 distance).

Why it's used

Exact k-NN over millions to billions of vectors is O(N·D) per query — intractable at request volumes typical for ads ranking, search, or recommendation. ANN indices trade an accepted recall-vs-latency tradeoff (usually ≥95% recall at O(log N) or better) for orders-of-magnitude faster retrieval.

Typical algorithmic families:

  • HNSW (Hierarchical Navigable Small World graph) — graph-based; state-of-the-art recall/latency; popular in practice (Lucene, FAISS, Vespa, Qdrant).
  • IVF / IVFPQ (inverted file + product quantization) — partitioning + compression; used in FAISS, Milvus.
  • Annoy (Spotify's random projection trees) — read-only, simple.
  • ScaNN (Google) — learned quantization + pruning.

Role in production recommendation systems

An ANN index is the serving artifact for item embeddings in production recommendation / ads / search systems. Candidates flow through it at several points in the funnel:

  • Retrieval — generate candidate set from billions of items.
  • Early ranking (e.g., Pinterest L1) — narrow further under tight latency before expensive downstream ranking.
  • Similar-item / related-item surfaces — direct user-facing applications of k-NN.

The serving-artifact distinction

A crucial production-engineering point, central to Pinterest's 2026-02-27 O/O retrospective (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr):

"It's not enough for features to exist in training logs or the Feature Store — they also need to be present in the serving artifacts (like ANN indices) that L1 actually uses to serve traffic."

The ANN index is built from a different feature pipeline than the one the model trained on, and often a different pipeline than the L2 Feature Store that downstream stages consume. A feature that's in training logs + the Feature Store but never onboarded into the ANN-index build path is effectively invisible to any stage that reads from that index — causing silent online-offline discrepancy.

Update cadence + version skew

ANN indices are typically rebuilt on a cadence much slower than model-release cadence: hourly snapshots for streaming enrichment, multi-day full rebuilds on large tiers at Pinterest scale. This means:

  • The index holds a mix of embedding versions at any moment.
  • Query-side embeddings (which run at request time from the live query tower) refresh instantly on model rollout.
  • Item-side embeddings (which must propagate through snapshot + rebuild + deploy) lag by hours to days.

This structural cadence mismatch produces embedding version skew, a specific cause of O/O discrepancy in two-tower systems. Pinterest mitigates by favoring batch embedding inference for large tiers so each rebuild uses a single consistent checkpoint.

Design axes

  • Recall target — how close to exact k-NN the index must come; drives algorithm + parameter choice.
  • Latency budget — how much query-time compute is acceptable.
  • Build-time budget — how quickly the index can be rebuilt + deployed; caps refresh cadence.
  • Memory footprint — HNSW graphs are memory-hungry; PQ-style indices trade memory for recall.
  • Update pattern — streaming upserts vs batch rebuilds.

Seen in

Last updated · 319 distilled / 1,201 read