CONCEPT Cited by 1 source

Beam search retrieval¶

Definition¶

Beam search retrieval is the use of beam search at inference time to generate K distinct candidate item identifiers from an autoregressive recsys decoder. At each decoding step, beam search maintains the top-B (beam width) most likely partial sequences; when decoding completes, the resulting B distinct full sequences are mapped back to candidate items by a downstream lookup.

Beam search retrieval is the inference primitive that consumes a Semantic ID vocabulary in generative retrieval systems (TIGER, Instacart's generative ads retrieval).

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"During serving, we build the candidate set via beam search. As illustrated in the diagram below, the decoder reads this input and generates recommendations token by token. At each step, beam search explores multiple promising paths for the next codeword. This ultimately yields several distinct, fully formed SID sequences. Finally, these generated sequences are mapped against a retailer-partitioned index to retrieve a diverse variety of relevant, available ad products."

How it differs from beam search in NLP¶

Beam search is the same algorithm as in NLP / machine translation, but the primitive's role in recsys is different:

Axis	NLP beam search	Recsys beam search
Output	Single best sequence (top-1 from beam)	All K sequences (top-K candidates)
Goal	Best translation / response	Diverse candidate set
Termination	Decode complete sentence	Decode K codewords (fixed-length SID)
Tunable	Beam width tuned for accuracy	Beam width + temperature tuned for exploration
Downstream	Output is the answer	Output is K candidates → ranking stage

The recsys variant wants diversity in the beam, not single-best output. That makes the width of the beam directly tunable as a diversity dial.

Two structural advantages over scoring retrieval¶

1. Hierarchical retrieval discipline¶

When the autoregressive decoder generates the first codeword, it is choosing a coarse semantic neighbourhood (per the hierarchical-codebook property of RQ-VAE). Subsequent codewords narrow within that neighbourhood. Beam search at each step is therefore choosing among semantically meaningful options, not across the full catalog.

"Generating auto regressively means each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval. If the model begins generating a prefix for 'Produce,' the beam search remains confined to that semantic neighborhood, actively preventing the random outlier leakage caused by flat probability distributions."

This dissolves the "laundry detergent in a breakfast cart" structural-drift failure mode of scoring retrieval over flat atomic- ID distributions.

2. Surface-specific tuning without retraining¶

Beam search exposes two runtime knobs not available to scoring retrieval:

Beam width — wider beam = more diverse candidates.
Temperature — higher temperature = more exploration in the per-step distribution.

The same model can be tuned per surface:

"These serve as precise levers to balance intent and exploration — allowing us to dial up strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces."

See concepts/diversity-via-beam-and-temperature.

Cost geometry¶

Beam search retrieval cost is roughly:

cost = B (beam_width) × K (decode_steps) × decoder_compute_per_step

This is independent of catalog size — the fundamental property that lets generative retrieval escape the vocabulary bottleneck. A catalog 10× larger doesn't change the inference cost at all (only the eventual mapping-layer lookup grows).

For Instacart's 2026-06 deployment: K=4 codewords, beam_width undisclosed, decoder_compute_per_step is what TensorRT-LLM optimises. The per-request candidate volume ends up 2× the prior CR system's at 10-17% lower mean latency.

Required substrate: post-decode mapping¶

Generative retrieval produces SIDs, not specific products. A post-decode mapping layer is required to resolve the SID to actual candidate items in the consumer's context. Instacart's choice: retailer-partitioned index — "these generated sequences are mapped against a retailer-partitioned index to retrieve a diverse variety of relevant, available ad products". The partition by retailer ensures only available, attributed ads for the current retailer's catalog are surfaced.

Caveats¶

Beam search adds latency proportional to B × K; the choice of beam width is a quality-cost trade-off. Instacart's specific beam-width value not disclosed.
Beam search's diversity is vocabulary-mediated — if the codebook concentrates many distinct products under the same SID, beam search produces SID diversity but not product diversity.
Beam search at large B can produce sequences that don't map to any product in the post-decode index; rejection sampling / beam-extension fallback strategies aren't disclosed in the source.
Speculative decoding and other autoregressive-acceleration tricks (covered elsewhere on the wiki for LLM serving) compose with beam search but aren't disclosed for Instacart's specific deployment.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure: beam search as the inference primitive for Instacart's generative ads retrieval.

concepts/generative-retrieval — paradigm beam search is the inference primitive for.
concepts/semantic-id — what the beam search decodes into.
concepts/diversity-via-beam-and-temperature — tuning dials.
concepts/retailer-partitioned-index — the post-decode mapping layer.
systems/instacart-generative-ads-retrieval / systems/tiger-generative-retrieval — the production system / reference paper.