PATTERN Cited by 1 source
Generative retrieval over scoring retrieval¶
Pattern¶
When a recsys retrieval stage hits the vocabulary bottleneck of scoring atomic item IDs at non-stationary catalog scale, replace the scoring model with an autoregressive generative model that decodes Semantic IDs token-by-token via beam search.
The pattern composes three ingredients that must change together:
- Vocabulary substrate change — atomic item IDs → Semantic IDs from an RQ-VAE codebook.
- Inference paradigm change — softmax-over-vocabulary scoring → autoregressive token generation via beam search.
- Serving stack change — Python+CPU → GPU stack with TensorRT- LLM + Triton (patterns/gpu-serving-stack-tensorrt-llm-triton).
Any one ingredient without the others doesn't produce the headline benefits.
Where the pattern shows up¶
- TIGER paper (systems/tiger-generative-retrieval — Google DeepMind, NeurIPS 2023) — academic demonstration on public benchmarks.
- Spotify GLIDE / NEO — production deployment at music recommendation scale.
- YouTube PLUM — production deployment with multilingual extension to natural-language tokens alongside SIDs.
- Instacart 2026-06 (systems/instacart-generative-ads-retrieval) — production deployment with grocery-distinctive prompt template
- retailer-partitioned mapping; first wiki canonical disclosure with operational numbers.
- Google DeepMind ActionPiece — emerging direction extending the substrate from items to user actions.
Why all three changes are required together¶
Each component without the others falls short:
| Combination | Fails because |
|---|---|
| Generative retrieval + atomic IDs | Vocabulary bottleneck re-emerges — generating tokens from a catalog-sized vocabulary is expensive and sparse. |
| Semantic IDs + scoring retrieval | Wastes the prefix-sharing property; flat-distribution outlier leakage persists; beam-width-as-diversity dial unavailable. |
| Generative + Semantic IDs + CPU serving | Latency unviable — autoregressive decoding with beam search is "fairly compute intensive"; legacy stack "not viable" per the source. |
The pattern is a substrate + paradigm + serving co-redesign, not a single architectural choice.
The wins, mapped to the structural changes¶
| Win | Structural cause |
|---|---|
| Catalog coverage from day 1 | Semantic IDs cover all items via codebook prefixes |
| Generalisation over memorisation | Codeword-space training, not atomic-ID training |
| 125× embedding-parameter reduction | Codebook-bounded vocabulary, not catalog-bounded |
| Coherent candidate sets | Autoregressive prefix conditioning during beam search |
| Tunable per-surface diversity | Beam width + temperature dials, not retraining |
| 2× candidate volume at −10–17% latency | GPU serving stack absorbs the autoregressive cost |
| +5% CTR / +34% ATC / 2.7× brand diversity | Composition of the above |
Structural pieces¶
1. Vocabulary substrate
├─ RQ-VAE trained on item features
└─ Output: K-codeword Semantic IDs per item
│
▼
2. Generative retriever
├─ Autoregressive Transformer decoder
├─ Trained on (context_template, next_item_SID) pairs
└─ Inference: beam search over codeword positions
│
▼
3. Post-decode mapping
├─ retailer-partitioned index (or equivalent)
└─ SIDs → available, attributed candidates
│
▼
4. Serving substrate
├─ TensorRT-LLM compiled decoder
├─ Triton Inference Server
├─ Go-native service shell
└─ Hosted on ML platform (Griffin 2.0)
│
▼
downstream ranker (unchanged)
When NOT to apply this pattern¶
The pattern's structural cost is high (substrate + paradigm + serving all change). Conditions under which scoring retrieval remains right:
- Stationary or small catalog — no vocabulary bottleneck.
- No usable item-feature representation — RQ-VAE can't produce a meaningful codebook from arbitrary IDs.
- Latency budget too tight for autoregressive decoding — the beam_width × decode_steps cost may exceed available envelope on some surfaces.
- No GPU serving substrate — pattern is dependent on GPU serving; CPU-only environments can't host it economically.
- Tight precision on a narrow-intent surface — scoring + reranker may produce sharper top-K than beam-search exploration. (Per the Instacart source, generative retrieval was deployed on browse surfaces specifically; search surfaces with narrow intent were not yet migrated.)
Surfaces where the pattern composes well¶
- Browse / discovery — diversity matters more than precision.
- Cart-completion / pre-checkout — brand-exploration matters.
- Post-checkout — maximum diversity surfaces work.
- Multi-tenant retail-media — retailer-partitioned mapping layer is a natural fit.
Caveats¶
- This is a young pattern. Long-term operational characteristics (codebook drift, retraining cadence, multi-surface tuning discipline) not yet well-published.
- The substrate change is the highest-risk ingredient: SIDs must remain stable across retraining or downstream consumers suffer version skew.
- The serving-substrate change has independent value (Python+CPU → Go+GPU is good for any compute-intensive retrieval workload) but it does not produce the recsys wins on its own.
- Ranker-side calibration with the new candidate distribution is acknowledged as risk in the source but not addressed architecturally.
Seen in¶
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure of the pattern at production scale with operational numbers.
Related¶
- concepts/generative-retrieval — the canonical concept.
- concepts/semantic-id / concepts/atomic-product-id-vs-semantic-id / concepts/vocabulary-bottleneck / concepts/beam-search-retrieval — supporting concepts.
- systems/instacart-generative-ads-retrieval / systems/instacart-contextual-recommendations / systems/tiger-generative-retrieval — production / reference systems.
- patterns/rq-vae-codebook-as-product-vocabulary / patterns/context-template-prompt-with-special-tokens / patterns/beam-search-with-retailer-partitioned-mapping / patterns/gpu-serving-stack-tensorrt-llm-triton / patterns/go-native-ml-serving — sibling patterns that compose to make this one work.