INSTACART Tier 2

Instacart — From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart¶

Summary¶

Instacart rebuilt their ads retrieval candidate generator for browse surfaces (retailer home page, pre-checkout) by replacing a BERT-based encoder that scores every product ID in a fixed vocabulary with a generative model that spells out the next product token-by-token via beam search. The new vocabulary is not atomic product IDs — it is Instacart Semantic IDs (SIDs): short sequences of codewords from an RQ-VAE-learned codebook (e.g. 35_7_120_184) where semantically similar products share prefixes. The architecture follows TIGER (Google DeepMind, 2023) — recently adopted in production by Spotify (GLIDE, NEO) and YouTube (PLUM) — adapted to grocery's distinctive "shopping list spans fresh food to cleaning supplies and pet care all within a single session" shape via a prompt template with special tokens (retailer-type token + user-history SIDs + cart SIDs). At serve time the beam search generates several distinct SID sequences token-by-token; each generated SID is mapped back to ads products via a retailer- partitioned index. The shift addresses three structural ceilings of the prior scoring architecture (the vocabulary bottleneck, the cold start hurdle for new products, and structural drift from flat probability distributions) and produced +5% click-through rate, +34% add-to-carts, 2.7× more brands and 1.8× more sub-categories in retrieved candidates, with +421% diversity in Alcohol, +396% in Beverages, +229% in Healthcare. Despite producing ~2× the candidate volume, mean retrieval latency decreased 10–17% — validated by a new GPU serving stack built on TensorRT-LLM + NVIDIA Triton Inference Server, deployed as a Go-native service on Griffin 2.0, Instacart's ML serving platform.

Key takeaways¶

From scoring to spelling — the load-bearing thesis. Verbatim: "We rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token." The prior Contextual Recommendations (CR) model — a BERT-like Transformer trained on "millions of authentic shopping sessions to predict the next token (i.e. singular product) in the sequence" where each token was an atomic product ID — "scores every product ID in its vocabulary against the current session and returns the top K products." The new model generates the SID of the next item instead of scoring all products. This is structurally the same shift Spotify (GLIDE, NEO) and YouTube (PLUM) made — "a generative paradigm has been adopted in production by companies such as Spotify (GLIDE, NEO) and YouTube (PLUM)" — applied to Instacart's grocery-distinctive intent shape.
Three structural ceilings that scoring hit. Each one is verbatim a structural problem of the scoring architecture, not a tuning problem:
The vocabulary bottleneck. "The CR model relies on atomic product IDs as distinct tokens, which establishes the boundaries of what the model can interpret and predict. While expanding this vocabulary enhances the model's ability to understand the detailed context of a user's session, it simultaneously increases model size and latency while creating data sparsity for less common items. Additionally this catalog is non-stationary. As new products are added to the catalog, the coverage gap keeps expanding."
The 'cold start' hurdle. "This occasionally caused it to memorize co-occurrences instead of learning generalized associations based on the user's intent. This resulted in the model favoring high-frequency items over newer products which are more aligned with the user's context." Concrete example given: "while a user is building a cart toward a summer barbecue [eg: ground beef, hamburger buns, lettuce], the previous system had a tendency to default to a generic grocery staple [eg: milk] rather than surfacing an emerging brand's condiment [eg: mustard] that fits the intent better."
Structural drift. "The final candidate set from the model is generated by predicting a probability distribution across the entire vocabulary of product IDs. Without a built-in hierarchy to keep the recommendations focused, the model occasionally retrieves a disjointed mix of items. For example, a breakfast-themed cart [e.g., milk, eggs, cereal] may lead to laundry detergent being retrieved along with other valid recommendations [e.g., bread, muffins]."
Semantic IDs replace atomic product IDs as the vocabulary. SIDs are "short sequences of codewords generated by an RQ-VAE". Example: "A product's SID looks like 35_7_120_184: four tokens from learned codebooks at different granularity levels." Three load-bearing properties:
Coverage to every catalog item regardless of purchase history. "A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one." — directly addresses the cold-start hurdle.
Generalisation over memorisation. "The model learns to generalize sequences better based on semantic codewords instead of simply learning specific product co-occurrences."
Embedding-parameter compression. "The embedding parameter space within the model is decreased by 125x." Verbatim example showing prefix-sharing: 35_7_119_493 (Organic Good Seed Thin Sliced) / 35_7_120_184 (Artisanal Italian Bread) / 35_7_120_185 (Classic Italian Bread) — three semantically similar bread products share the first two codewords.
The context template — a richer training corpus. Compressing the vocabulary frees up token budget for context. The prompt template (per the post): "A retailer type token tells the model which catalog and shopping context the user is shopping in. Because our marketplace retailers span grocery, pet, beauty, home goods, and more, this token helps us capture the distinction. User history SIDs from past purchases capture long-term preferences. By taking the top N previously purchased SIDs and expressing them in the same token format the model generates in, we seamlessly connect past behavior to future predictions. Cart SIDs capture the real-time intent of the current session. While user history tells the model what someone typically likes, the cart SIDs tells it what they are building today, adapting as new items are added." Architectural property: "The template structure also gives us a clean interface for future signals (such as occasion awareness, search queries, page type) without architectural changes. Each new signal is simply a new segment in the prompt."
Beam search at serving time, retailer-partitioned mapping at the end. Verbatim: "During serving, we build the candidate set via beam search. As illustrated in the diagram below, the decoder reads this input and generates recommendations token by token. At each step, beam search explores multiple promising paths for the next codeword. This ultimately yields several distinct, fully formed SID sequences. Finally, these generated sequences are mapped against a retailer-partitioned index to retrieve a diverse variety of relevant, available ad products." The retailer partition ensures only available, attributed ads for the current retailer's catalog are surfaced.
Three scoring-architecture ceilings, each addressed by a structural property of the generative paradigm. Direct mapping from the post:
Vocabulary bottleneck → fixed codebook size. "By generating sequences from a small, fixed set of codewords rather than scoring an ever-expanding list of product IDs, the scaling constraints of the vocabulary bottleneck disappear. The model constructs the semantic representation of the next item on the fly, avoiding the memory and latency penalties that previously restricted our catalog coverage."
Structural drift → autoregressive prefix conditioning. "Generating auto regressively means each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval. If the model begins generating a prefix for 'Produce,' the beam search remains confined to that semantic neighborhood, actively preventing the random outlier leakage caused by flat probability distributions."
Cold start → semantic generalisation, not memorisation. SIDs give every new product a position in the codebook from day one without requiring transaction history.
Beam width and temperature as exploration dials. "Unlike scoring models, the generative approach unlocks direct tuning mechanisms through beam width and temperature sampling. These serve as precise levers to balance intent and exploration — allowing us to dial up strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces." The same model + index serves multiple surfaces with different precision/exploration trade-offs without retraining.
GPU serving stack — TensorRT-LLM + Triton + Go-native + Griffin 2.0. Verbatim: "As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server. … Implemented as a Go-native service, it delivers higher throughput and lower latency compared to the legacy Python environment. It is fully integrated with Griffin 2.0, Instacart's machine learning serving platform." Three high-speed operations: (1) Input Translation — features dynamically fetched + collated into the input prompt; (2) GPU Model Inference — beam-search SID generation; (3) Product Mapping and Indexing — SIDs mapped via retailer- partitioned index to active ad products "ensuring that only relevant, available, and correctly attributed ads are retrieved."
Headline operational results — paid two costs, banked three wins. Despite producing ~2× the candidate volume, mean retrieval latency decreased by 10–17%. Online A/B test against the incumbent model: +5% click-through rate, +34% add-to- carts (post calls this "step-function increase"). Qualitative: "a customer purchasing pet food at a big box retailer now receives pet-specific recommendations instead of broader grocery suggestions" — tail-category alignment in pet care + beauty.
Brand diversity is the load-bearing wins. "By overcoming the limitations of a fixed token space, TIGER recommended 2.7x more brands and 1.8x more sub-categories than the previous system." The category-conditional gains are dramatic: "+421% in Alcohol, +396% in Beverages, and +229% in Healthcare. In these categories, the previous solution's architectural ceiling prevented these products from being retrieved." Stated thesis: "This unlocks new potential for Instacart's ads ecosystem, creating a valuable opportunity for emerging brands to drive growth by surfacing their products in highly contextual placements."

Architecture & numbers¶

Datum	Value	Source / context
Prior model architecture	BERT-like Transformer scoring atomic product IDs	systems/instacart-contextual-recommendations (CR)
New model architecture	Autoregressive decoder generating SIDs token-by-token via beam search	systems/instacart-generative-ads-retrieval
Vocabulary unit	Atomic product IDs → Semantic IDs (codeword sequences)	RQ-VAE-learned codebook
Example SID	`35_7_120_184`	4 tokens from learned codebooks at different granularity levels
SID prefix-sharing example	`35_7_119_493` / `35_7_120_184` / `35_7_120_185`	All bread products share `35_7_…` prefix
Embedding param-space reduction	125×	One of three SID-vocabulary benefits
Prompt template segments	retailer-type token + user-history SIDs + cart SIDs	Special tokens between segments
Inference primitive	Beam search at serve time	Multiple promising paths per token step
Mapping layer	Retailer-partitioned index	Constrains output to available, attributed ads
Tunable dials	Beam width + temperature sampling	Per-surface intent vs exploration
Serving substrate	TensorRT-LLM on NVIDIA Triton Inference Server	GPU inference stack
Service language	Go-native service	Replaces legacy Python+CPU stack
ML platform	Griffin 2.0	Instacart's ML serving platform
Candidate volume change	~2× more candidates	Larger candidate sets per request
Mean retrieval latency change	−10–17%	Despite 2× candidate volume
Click-through rate	+5%	A/B test vs incumbent
Add-to-carts	+34%	A/B test vs incumbent ("step-function increase")
Brand diversity	2.7× more brands	Vs previous system
Sub-category diversity	1.8× more sub-categories	Vs previous system
Alcohol diversity	+421%	Highest dense-category lift
Beverages diversity	+396%	Second-highest dense-category lift
Healthcare diversity	+229%	Third-highest dense-category lift
Surfaces launched on	Retailer home page + pre-checkout phase	Browse-not-search contexts

Architectural primitives extracted¶

New systems¶

systems/instacart-generative-ads-retrieval — the generative retrieval model itself; autoregressive decoder; beam search at serve; retailer-partitioned index mapping; replaces the prior CR scoring model on browse surfaces.
systems/instacart-semantic-ids — RQ-VAE-learned codebook representation of catalog products as short codeword sequences; the vocabulary substrate the generative retriever operates on.
systems/instacart-contextual-recommendations — the prior BERT-based scoring model the new system replaces; canonicalised on the wiki for the deprecation-axis story.
systems/instacart-griffin-2 — Instacart's ML serving platform (named in the post as the host of the new GPU stack).
systems/tiger-generative-retrieval — Google DeepMind's generative retrieval architecture (NeurIPS 2023) the post is inspired by; canonicalised as a reference system.
systems/tensorrt-llm — NVIDIA's high-performance LLM inference engine; substrate for the new serving stack.
systems/nvidia-triton-inference-server — NVIDIA's serving platform; the runtime layer above TensorRT-LLM.
systems/rq-vae — Residual Quantized Variational Autoencoder; the algorithm that learns the codebook from product features.

New concepts¶

concepts/generative-retrieval — the broader paradigm: recommendation-as-spelling-the-next-item, replacing recommendation-as-scoring-all-items.
concepts/semantic-id — codebook-based item ID with prefix-sharing semantic similarity property.
concepts/atomic-product-id-vs-semantic-id — the vocabulary trade-off canonicalised: opaque-but-precise atomic IDs vs semantic-but-shared codeword sequences.
concepts/vocabulary-bottleneck — the failure mode of atomic-ID scoring at non-stationary catalog scale.
concepts/beam-search-retrieval — beam search as the retrieval primitive; produces several distinct candidate sequences per request.
concepts/retailer-partitioned-index — the post-generation mapping layer that constrains generic SIDs to retailer-specific available, attributed ad products.
concepts/diversity-via-beam-and-temperature — beam width + temperature as per-surface exploration dials.

New patterns¶

patterns/generative-over-scoring-retrieval — the architectural pattern shift: token-by-token generation replaces vocabulary-wide scoring.
patterns/rq-vae-codebook-as-product-vocabulary — the vocabulary substrate pattern: learn codebooks from product features, encode each product as a short codeword sequence, share prefixes for semantic similarity.
patterns/context-template-prompt-with-special-tokens — prompt structure: special tokens delimit retailer + history + cart segments; new signals join as new segments without architectural changes.
patterns/beam-search-with-retailer-partitioned-mapping — the serving-time pattern: beam search generates K SID sequences → retailer-partitioned index maps to available ad products.
patterns/gpu-serving-stack-tensorrt-llm-triton — the serving substrate pattern: TensorRT-LLM compiled model + Triton serving + GPU hardware for autoregressive-decoding workloads.
patterns/go-native-ml-serving — Go-native service replacing Python+CPU legacy stack as the request-handling shell around the GPU inference engine.

Caveats¶

No specific numbers on codebook size, beam width, or temperature. The post discloses that these are tunable but not the production values.
No latency p-values disclosed beyond the headline. The 10-17% mean-latency reduction is given; tail (p99/p99.9) latency vs the prior system is not.
No QPS / cluster topology / GPU SKU disclosed. Griffin 2.0, TensorRT-LLM, Triton, Go-native are named; throughput envelope and GPU SKUs are not.
No training cost or data volume disclosed. "Millions of historical shopping sessions" is the only training-corpus framing; no token budget, training time, or accelerator count given.
Surface scope is intentionally narrow. Only retailer home page
pre-checkout phase shipped; the post acknowledges these are "contexts where users are browsing rather than searching, and candidate diversity & contextual relevance matter more than surgical precision." Search and post-checkout surfaces remain future work.
No comparison to alternative generative-retrieval shapes. TIGER is named as inspiration; Spotify GLIDE/NEO and YouTube PLUM are cited as production references; ActionPiece (Google DeepMind) is cited as a future direction. The post does not benchmark alternative codebook designs (e.g. per-vertical codebooks, learned vs fixed token boundaries).
No ablation on context-template segments. Retailer-type token
user history + cart SIDs are all present in the launched system; the marginal contribution of each segment is not disclosed.
No ranker-side changes disclosed. The post is exclusively about the candidate-generation (retrieval) stage. "If the subsequent ranking model was miscalibrated on these outlier products, these incoherent recommendations from the candidate set would eventually get bubbled up to the user" — the ranker is acknowledged as a failure mode of the prior CG but no ranker work is reported.
SID quality is acknowledged as the main lever for downstream improvement but no specific roadmap items have shipped: "Future improvements include multi-resolution codebooks, co-occurrence contrastive regularization, and incorporating dietary constraints into the initial codebook level."
No per-retailer / per-vertical breakdown of the headline lifts. +421% Alcohol / +396% Beverages / +229% Healthcare diversity is given, but absolute baselines are not — meaning a +421% lift could reflect either an architecturally constrained baseline of near-zero or a meaningful baseline that was tripled.
Ad load and revenue impact not disclosed. +5% CTR and +34% add-to-carts are reported; ad-revenue lift, advertiser ROAS, and bid-density effects are not. (The companion Carrot Ads posts similarly omit ad-revenue lift in absolute terms.)
Companion SID post referenced but not consulted here. The post links a separate write-up: Semantic IDs: Product Understanding at Scale — full RQ-VAE design space deferred to that companion. Future ingest of that source would extend systems/instacart-semantic-ids
systems/rq-vae + concepts/semantic-id with mechanism detail.
Brand-diversity gains may interact with advertiser-economics incentives — emerging brands' ads have systematically lower baseline volume, so a 2.7× brand-count metric on a pCTR-driven serving stack could reflect either a genuine intent-match improvement or a regression in the ranker's ability to suppress brand-spam. The post asserts CTR + ATC are up, suggesting it's the former, but no breakdown by emerging-brand vs established-brand click attribution is given.

Source¶

Original: https://tech.instacart.com/from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart-cf36b4e8d1bb?source=rss----587883b5d2ee---4
Raw markdown: raw/instacart/2026-06-02-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instaca-d794a9cc.md
Prior CR write-up referenced: Sequence Models for Contextual Recommendations at Instacart (linked from the post)
Companion SID write-up referenced: Semantic IDs: Product Understanding at Scale (linked from the post)
Reference paper: TIGER — Recommender Systems with Generative Retrieval (Rajput et al., NeurIPS 2023, Google DeepMind)
Production references cited: Spotify GLIDE / NEO; YouTube PLUM; Google DeepMind ActionPiece (future direction)

companies/instacart — adds this source as the generative- retrieval axis complementary to the existing Carrot Ads domain-adaptive-learning axis (systems/instacart-carrot-ads-pctr-model), the Generative Recommendations / Shopping Hub axis (systems/instacart-generative-recommendations-platform), the Multi-tenant marketing axis (systems/instacart-storefront-pro), and the Cold-start (new partner) axis.
systems/instacart-carrot-ads — Carrot Ads' pCTR ranker still runs over the new generative CG's candidate set; this source is on the retrieval axis, the 2026-05-04 Carrot Ads post is on the ranking axis.
systems/silvertorch — Meta's concepts/index-as-model retrieval-paradigm sibling. Both post deeply rethink the scoring retrieval shape; SilverTorch keeps two-tower asymmetric pre-compute but absorbs the ANN index into the model graph as a tensor; Instacart abandons two-tower / ANN entirely and replaces it with autoregressive generation. Two architecturally orthogonal alternatives to "score every item against the request".
systems/pinterest-contextual-sequential-cg — Pinterest's Transformer-based two-tower CG with subject-Pin context layer; same family as Instacart's prior CR (sequence-model-as-CG with real-time context) but stays on the scoring side of the scoring/generative divide.
concepts/generative-retrieval / concepts/semantic-id / concepts/vocabulary-bottleneck — canonical pages this source establishes.
concepts/two-tower-architecture / concepts/retrieval-ranking-funnel / concepts/index-as-model — sibling retrieval primitives the generative paradigm sits alongside / replaces in the design space.
concepts/cold-start — recsys cold-start axis; SIDs give every new product day-1 visibility via codebook-prefix sharing without transaction history.