Skip to content

CONCEPT Cited by 1 source

Generative retrieval

Definition

Generative retrieval is a recommendation / search architecture where the retrieval stage generates the identifier of the next relevant item token-by-token via an autoregressive decoder, instead of scoring every candidate against the request. The item identifier is typically a sequence of codewords from a learned Semantic ID codebook (see RQ-VAE) that have prefix-sharing semantic similarity.

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"We rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token."

Scoring vs generation

Two architectures, two cost structures, two failure modes:

Axis Scoring retrieval Generative retrieval
Output shape Probability over full vocabulary Token-by-token sequence
Vocabulary Atomic item IDs Semantic IDs (codeword sequences)
Inference primitive Top-K from scored vocabulary Beam search over codeword positions
Vocabulary growth Catalog-bounded (bottleneck) Codebook-bounded
Cold-start (new items) Hard — needs transaction history Easy — codebook covers all items
Coherence within candidate set Flat — laundry detergent in a breakfast cart Hierarchical — autoregressive prefix conditioning constrains beam
Tunable dial Top-K threshold Beam width + temperature
Compute cost O(vocab size) per request O(beam_width × decode_steps × decoder_compute)
Production examples Pinterest two-tower CGs, Meta SilverTorch (in-graph index) TIGER (Google), Spotify GLIDE/NEO, YouTube PLUM, Instacart 2026-06

Why generation specifically

The 2026-06 Instacart source articulates three structural ceilings of scoring that generation dissolves:

  1. Vocabulary bottleneck — scoring a fixed atomic-ID vocabulary forces a model-size / sparsity / coverage trade-off. "The model constructs the semantic representation of the next item on the fly, avoiding the memory and latency penalties that previously restricted our catalog coverage."
  2. Cold-start hurdle — atomic-ID models memorise co-occurrences; new products without history can't be retrieved. Semantic IDs give every product a codebook position from day 1.
  3. Structural drift — flat probability distributions across a heterogeneous vocabulary leak across semantic neighbourhoods. "Generating auto regressively means each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval. If the model begins generating a prefix for 'Produce,' the beam search remains confined to that semantic neighborhood."

Tunable dials — beam width and temperature

A scoring model has one knob: top-K. A generative retriever has two:

  • Beam width controls how many candidate sequences are tracked at each decode step. Wider beam = more candidate diversity.
  • Temperature controls the entropy of the token distribution at each step. Higher temperature = more exploration; lower = more exploitation.

The wins compose: "Unlike scoring models, the generative approach unlocks direct tuning mechanisms through beam width and temperature sampling. These serve as precise levers to balance intent and exploration — allowing us to dial up strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces." — see concepts/diversity-via-beam-and-temperature.

Sibling architectures in the broader retrieval design space

Generative retrieval sits alongside three other retrieval paradigms on the wiki:

  • Two-tower / dual-encoder — asymmetric pre-compute: item embeddings indexed offline, query embedding computed once per request, scoring via dot-product over the ANN index.
  • Index as Model (Meta SilverTorch 2026-05-26) — items live as a tensor inside the retrieval model graph; the cross-service hop disappears but the scoring paradigm is preserved.
  • Sequence-model scoring — Pinterest contextual sequential CG, Instacart's prior CR — Transformer-based two-tower with sequence inputs but still output a probability distribution over the full atomic-ID vocabulary.
  • Generative retrieval — TIGER, Spotify GLIDE/NEO, YouTube PLUM, Instacart 2026-06 — abandons scoring entirely; recommendation becomes autoregressive sequence generation.

The 2026-05-26 SilverTorch source and the 2026-06-02 Instacart source are architecturally orthogonal alternatives to "score every item against the request": SilverTorch keeps two-tower asymmetric pre-compute but absorbs the index into the model graph; Instacart abandons two-tower / ANN entirely and replaces it with autoregressive generation.

When NOT to use generative retrieval

Conditions under which scoring retrieval remains the right choice:

  • Item identifiers don't have learnable structure. Generative retrieval depends on a meaningful codebook (RQ-VAE over rich item features). Without it, the codeword vocabulary is arbitrary and the prefix-sharing benefit disappears.
  • Latency budget is too tight for autoregressive decoding. Decoding cost scales with sequence length × beam width; for ultra-low-latency surfaces (sub-millisecond ad serving), scoring may still win.
  • Tail precision matters more than diversity. Generative retrieval shines when the request is broad and the win is brand / item diversity. For narrow-intent surgical retrieval (e.g. a specific search query for a known brand), scoring + reranking may still win.
  • No GPU serving substrate available. As Instacart explicitly notes, the legacy "Python and CPU inference" stack is "not viable" — generative retrieval requires a GPU stack.

Caveats

  • This is a young paradigm — TIGER paper is 2023; production deployments at Spotify / YouTube / Instacart all 2024-2026.
  • Long-term stability of the codebook across re-training / catalog drift is not yet well-characterised in published work.
  • The post-decode mapping layer (Instacart's retailer-partitioned index) is essential for generic SID-to-real-product attribution; without it, the generated SID could fan out to many products with no ranking discipline at the mapping layer.

Seen in

Last updated · 542 distilled / 1,571 read