PATTERN Cited by 1 source

Generative retrieval over scoring retrieval¶

Pattern¶

When a recsys retrieval stage hits the vocabulary bottleneck of scoring atomic item IDs at non-stationary catalog scale, replace the scoring model with an autoregressive generative model that decodes Semantic IDs token-by-token via beam search.

The pattern composes three ingredients that must change together:

Vocabulary substrate change — atomic item IDs → Semantic IDs from an RQ-VAE codebook.
Inference paradigm change — softmax-over-vocabulary scoring → autoregressive token generation via beam search.
Serving stack change — Python+CPU → GPU stack with TensorRT- LLM + Triton (patterns/gpu-serving-stack-tensorrt-llm-triton).

Any one ingredient without the others doesn't produce the headline benefits.

Where the pattern shows up¶

TIGER paper (systems/tiger-generative-retrieval — Google DeepMind, NeurIPS 2023) — academic demonstration on public benchmarks.
Spotify GLIDE / NEO — production deployment at music recommendation scale.
YouTube PLUM — production deployment with multilingual extension to natural-language tokens alongside SIDs.
Instacart 2026-06 (systems/instacart-generative-ads-retrieval) — production deployment with grocery-distinctive prompt template
retailer-partitioned mapping; first wiki canonical disclosure with operational numbers.
Google DeepMind ActionPiece — emerging direction extending the substrate from items to user actions.

Why all three changes are required together¶

Each component without the others falls short:

Combination	Fails because
Generative retrieval + atomic IDs	Vocabulary bottleneck re-emerges — generating tokens from a catalog-sized vocabulary is expensive and sparse.
Semantic IDs + scoring retrieval	Wastes the prefix-sharing property; flat-distribution outlier leakage persists; beam-width-as-diversity dial unavailable.
Generative + Semantic IDs + CPU serving	Latency unviable — autoregressive decoding with beam search is "fairly compute intensive"; legacy stack "not viable" per the source.

The pattern is a substrate + paradigm + serving co-redesign, not a single architectural choice.

The wins, mapped to the structural changes¶

Win	Structural cause
Catalog coverage from day 1	Semantic IDs cover all items via codebook prefixes
Generalisation over memorisation	Codeword-space training, not atomic-ID training
125× embedding-parameter reduction	Codebook-bounded vocabulary, not catalog-bounded
Coherent candidate sets	Autoregressive prefix conditioning during beam search
Tunable per-surface diversity	Beam width + temperature dials, not retraining
2× candidate volume at −10–17% latency	GPU serving stack absorbs the autoregressive cost
+5% CTR / +34% ATC / 2.7× brand diversity	Composition of the above

Structural pieces¶

1. Vocabulary substrate
   ├─ RQ-VAE trained on item features
   └─ Output: K-codeword Semantic IDs per item
                   │
                   ▼
2. Generative retriever
   ├─ Autoregressive Transformer decoder
   ├─ Trained on (context_template, next_item_SID) pairs
   └─ Inference: beam search over codeword positions
                   │
                   ▼
3. Post-decode mapping
   ├─ retailer-partitioned index (or equivalent)
   └─ SIDs → available, attributed candidates
                   │
                   ▼
4. Serving substrate
   ├─ TensorRT-LLM compiled decoder
   ├─ Triton Inference Server
   ├─ Go-native service shell
   └─ Hosted on ML platform (Griffin 2.0)
                   │
                   ▼
        downstream ranker (unchanged)

When NOT to apply this pattern¶

The pattern's structural cost is high (substrate + paradigm + serving all change). Conditions under which scoring retrieval remains right:

Stationary or small catalog — no vocabulary bottleneck.
No usable item-feature representation — RQ-VAE can't produce a meaningful codebook from arbitrary IDs.
Latency budget too tight for autoregressive decoding — the beam_width × decode_steps cost may exceed available envelope on some surfaces.
No GPU serving substrate — pattern is dependent on GPU serving; CPU-only environments can't host it economically.
Tight precision on a narrow-intent surface — scoring + reranker may produce sharper top-K than beam-search exploration. (Per the Instacart source, generative retrieval was deployed on browse surfaces specifically; search surfaces with narrow intent were not yet migrated.)

Surfaces where the pattern composes well¶

Browse / discovery — diversity matters more than precision.
Cart-completion / pre-checkout — brand-exploration matters.
Post-checkout — maximum diversity surfaces work.
Multi-tenant retail-media — retailer-partitioned mapping layer is a natural fit.

Caveats¶

This is a young pattern. Long-term operational characteristics (codebook drift, retraining cadence, multi-surface tuning discipline) not yet well-published.
The substrate change is the highest-risk ingredient: SIDs must remain stable across retraining or downstream consumers suffer version skew.
The serving-substrate change has independent value (Python+CPU → Go+GPU is good for any compute-intensive retrieval workload) but it does not produce the recsys wins on its own.
Ranker-side calibration with the new candidate distribution is acknowledged as risk in the source but not addressed architecturally.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure of the pattern at production scale with operational numbers.

concepts/generative-retrieval — the canonical concept.
concepts/semantic-id / concepts/atomic-product-id-vs-semantic-id / concepts/vocabulary-bottleneck / concepts/beam-search-retrieval — supporting concepts.
systems/instacart-generative-ads-retrieval / systems/instacart-contextual-recommendations / systems/tiger-generative-retrieval — production / reference systems.
patterns/rq-vae-codebook-as-product-vocabulary / patterns/context-template-prompt-with-special-tokens / patterns/beam-search-with-retailer-partitioned-mapping / patterns/gpu-serving-stack-tensorrt-llm-triton / patterns/go-native-ml-serving — sibling patterns that compose to make this one work.