Skip to content

PATTERN Cited by 1 source

Generative retrieval over scoring retrieval

Pattern

When a recsys retrieval stage hits the vocabulary bottleneck of scoring atomic item IDs at non-stationary catalog scale, replace the scoring model with an autoregressive generative model that decodes Semantic IDs token-by-token via beam search.

The pattern composes three ingredients that must change together:

  1. Vocabulary substrate change — atomic item IDs → Semantic IDs from an RQ-VAE codebook.
  2. Inference paradigm change — softmax-over-vocabulary scoring → autoregressive token generation via beam search.
  3. Serving stack change — Python+CPU → GPU stack with TensorRT- LLM + Triton (patterns/gpu-serving-stack-tensorrt-llm-triton).

Any one ingredient without the others doesn't produce the headline benefits.

Where the pattern shows up

  • TIGER paper (systems/tiger-generative-retrieval — Google DeepMind, NeurIPS 2023) — academic demonstration on public benchmarks.
  • Spotify GLIDE / NEO — production deployment at music recommendation scale.
  • YouTube PLUM — production deployment with multilingual extension to natural-language tokens alongside SIDs.
  • Instacart 2026-06 (systems/instacart-generative-ads-retrieval) — production deployment with grocery-distinctive prompt template
  • retailer-partitioned mapping; first wiki canonical disclosure with operational numbers.
  • Google DeepMind ActionPiece — emerging direction extending the substrate from items to user actions.

Why all three changes are required together

Each component without the others falls short:

Combination Fails because
Generative retrieval + atomic IDs Vocabulary bottleneck re-emerges — generating tokens from a catalog-sized vocabulary is expensive and sparse.
Semantic IDs + scoring retrieval Wastes the prefix-sharing property; flat-distribution outlier leakage persists; beam-width-as-diversity dial unavailable.
Generative + Semantic IDs + CPU serving Latency unviable — autoregressive decoding with beam search is "fairly compute intensive"; legacy stack "not viable" per the source.

The pattern is a substrate + paradigm + serving co-redesign, not a single architectural choice.

The wins, mapped to the structural changes

Win Structural cause
Catalog coverage from day 1 Semantic IDs cover all items via codebook prefixes
Generalisation over memorisation Codeword-space training, not atomic-ID training
125× embedding-parameter reduction Codebook-bounded vocabulary, not catalog-bounded
Coherent candidate sets Autoregressive prefix conditioning during beam search
Tunable per-surface diversity Beam width + temperature dials, not retraining
2× candidate volume at −10–17% latency GPU serving stack absorbs the autoregressive cost
+5% CTR / +34% ATC / 2.7× brand diversity Composition of the above

Structural pieces

1. Vocabulary substrate
   ├─ RQ-VAE trained on item features
   └─ Output: K-codeword Semantic IDs per item
2. Generative retriever
   ├─ Autoregressive Transformer decoder
   ├─ Trained on (context_template, next_item_SID) pairs
   └─ Inference: beam search over codeword positions
3. Post-decode mapping
   ├─ retailer-partitioned index (or equivalent)
   └─ SIDs → available, attributed candidates
4. Serving substrate
   ├─ TensorRT-LLM compiled decoder
   ├─ Triton Inference Server
   ├─ Go-native service shell
   └─ Hosted on ML platform (Griffin 2.0)
        downstream ranker (unchanged)

When NOT to apply this pattern

The pattern's structural cost is high (substrate + paradigm + serving all change). Conditions under which scoring retrieval remains right:

  • Stationary or small catalog — no vocabulary bottleneck.
  • No usable item-feature representation — RQ-VAE can't produce a meaningful codebook from arbitrary IDs.
  • Latency budget too tight for autoregressive decoding — the beam_width × decode_steps cost may exceed available envelope on some surfaces.
  • No GPU serving substrate — pattern is dependent on GPU serving; CPU-only environments can't host it economically.
  • Tight precision on a narrow-intent surface — scoring + reranker may produce sharper top-K than beam-search exploration. (Per the Instacart source, generative retrieval was deployed on browse surfaces specifically; search surfaces with narrow intent were not yet migrated.)

Surfaces where the pattern composes well

  • Browse / discovery — diversity matters more than precision.
  • Cart-completion / pre-checkout — brand-exploration matters.
  • Post-checkout — maximum diversity surfaces work.
  • Multi-tenant retail-media — retailer-partitioned mapping layer is a natural fit.

Caveats

  • This is a young pattern. Long-term operational characteristics (codebook drift, retraining cadence, multi-surface tuning discipline) not yet well-published.
  • The substrate change is the highest-risk ingredient: SIDs must remain stable across retraining or downstream consumers suffer version skew.
  • The serving-substrate change has independent value (Python+CPU → Go+GPU is good for any compute-intensive retrieval workload) but it does not produce the recsys wins on its own.
  • Ranker-side calibration with the new candidate distribution is acknowledged as risk in the source but not addressed architecturally.

Seen in

Last updated · 542 distilled / 1,571 read