Skip to content

INSTACART Tier 2

Read original ↗

Instacart — From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart

Summary

Instacart rebuilt their ads retrieval candidate generator for browse surfaces (retailer home page, pre-checkout) by replacing a BERT-based encoder that scores every product ID in a fixed vocabulary with a generative model that spells out the next product token-by-token via beam search. The new vocabulary is not atomic product IDs — it is Instacart Semantic IDs (SIDs): short sequences of codewords from an RQ-VAE-learned codebook (e.g. 35_7_120_184) where semantically similar products share prefixes. The architecture follows TIGER (Google DeepMind, 2023) — recently adopted in production by Spotify (GLIDE, NEO) and YouTube (PLUM) — adapted to grocery's distinctive "shopping list spans fresh food to cleaning supplies and pet care all within a single session" shape via a prompt template with special tokens (retailer-type token + user-history SIDs + cart SIDs). At serve time the beam search generates several distinct SID sequences token-by-token; each generated SID is mapped back to ads products via a retailer- partitioned index. The shift addresses three structural ceilings of the prior scoring architecture (the vocabulary bottleneck, the cold start hurdle for new products, and structural drift from flat probability distributions) and produced +5% click-through rate, +34% add-to-carts, 2.7× more brands and 1.8× more sub-categories in retrieved candidates, with +421% diversity in Alcohol, +396% in Beverages, +229% in Healthcare. Despite producing ~2× the candidate volume, mean retrieval latency decreased 10–17% — validated by a new GPU serving stack built on TensorRT-LLM + NVIDIA Triton Inference Server, deployed as a Go-native service on Griffin 2.0, Instacart's ML serving platform.

Key takeaways

  1. From scoring to spelling — the load-bearing thesis. Verbatim: "We rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token." The prior Contextual Recommendations (CR) model — a BERT-like Transformer trained on "millions of authentic shopping sessions to predict the next token (i.e. singular product) in the sequence" where each token was an atomic product ID"scores every product ID in its vocabulary against the current session and returns the top K products." The new model generates the SID of the next item instead of scoring all products. This is structurally the same shift Spotify (GLIDE, NEO) and YouTube (PLUM) made — "a generative paradigm has been adopted in production by companies such as Spotify (GLIDE, NEO) and YouTube (PLUM)" — applied to Instacart's grocery-distinctive intent shape.

  2. Three structural ceilings that scoring hit. Each one is verbatim a structural problem of the scoring architecture, not a tuning problem:

  3. The vocabulary bottleneck. "The CR model relies on atomic product IDs as distinct tokens, which establishes the boundaries of what the model can interpret and predict. While expanding this vocabulary enhances the model's ability to understand the detailed context of a user's session, it simultaneously increases model size and latency while creating data sparsity for less common items. Additionally this catalog is non-stationary. As new products are added to the catalog, the coverage gap keeps expanding."

  4. The 'cold start' hurdle. "This occasionally caused it to memorize co-occurrences instead of learning generalized associations based on the user's intent. This resulted in the model favoring high-frequency items over newer products which are more aligned with the user's context." Concrete example given: "while a user is building a cart toward a summer barbecue [eg: ground beef, hamburger buns, lettuce], the previous system had a tendency to default to a generic grocery staple [eg: milk] rather than surfacing an emerging brand's condiment [eg: mustard] that fits the intent better."
  5. Structural drift. "The final candidate set from the model is generated by predicting a probability distribution across the entire vocabulary of product IDs. Without a built-in hierarchy to keep the recommendations focused, the model occasionally retrieves a disjointed mix of items. For example, a breakfast-themed cart [e.g., milk, eggs, cereal] may lead to laundry detergent being retrieved along with other valid recommendations [e.g., bread, muffins]."

  6. Semantic IDs replace atomic product IDs as the vocabulary. SIDs are "short sequences of codewords generated by an RQ-VAE". Example: "A product's SID looks like 35_7_120_184: four tokens from learned codebooks at different granularity levels." Three load-bearing properties:

  7. Coverage to every catalog item regardless of purchase history. "A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one." — directly addresses the cold-start hurdle.

  8. Generalisation over memorisation. "The model learns to generalize sequences better based on semantic codewords instead of simply learning specific product co-occurrences."
  9. Embedding-parameter compression. "The embedding parameter space within the model is decreased by 125x." Verbatim example showing prefix-sharing: 35_7_119_493 (Organic Good Seed Thin Sliced) / 35_7_120_184 (Artisanal Italian Bread) / 35_7_120_185 (Classic Italian Bread) — three semantically similar bread products share the first two codewords.

  10. The context template — a richer training corpus. Compressing the vocabulary frees up token budget for context. The prompt template (per the post): "A retailer type token tells the model which catalog and shopping context the user is shopping in. Because our marketplace retailers span grocery, pet, beauty, home goods, and more, this token helps us capture the distinction. User history SIDs from past purchases capture long-term preferences. By taking the top N previously purchased SIDs and expressing them in the same token format the model generates in, we seamlessly connect past behavior to future predictions. Cart SIDs capture the real-time intent of the current session. While user history tells the model what someone typically likes, the cart SIDs tells it what they are building today, adapting as new items are added." Architectural property: "The template structure also gives us a clean interface for future signals (such as occasion awareness, search queries, page type) without architectural changes. Each new signal is simply a new segment in the prompt."

  11. Beam search at serving time, retailer-partitioned mapping at the end. Verbatim: "During serving, we build the candidate set via beam search. As illustrated in the diagram below, the decoder reads this input and generates recommendations token by token. At each step, beam search explores multiple promising paths for the next codeword. This ultimately yields several distinct, fully formed SID sequences. Finally, these generated sequences are mapped against a retailer-partitioned index to retrieve a diverse variety of relevant, available ad products." The retailer partition ensures only available, attributed ads for the current retailer's catalog are surfaced.

  12. Three scoring-architecture ceilings, each addressed by a structural property of the generative paradigm. Direct mapping from the post:

  13. Vocabulary bottleneck → fixed codebook size. "By generating sequences from a small, fixed set of codewords rather than scoring an ever-expanding list of product IDs, the scaling constraints of the vocabulary bottleneck disappear. The model constructs the semantic representation of the next item on the fly, avoiding the memory and latency penalties that previously restricted our catalog coverage."

  14. Structural drift → autoregressive prefix conditioning. "Generating auto regressively means each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval. If the model begins generating a prefix for 'Produce,' the beam search remains confined to that semantic neighborhood, actively preventing the random outlier leakage caused by flat probability distributions."
  15. Cold start → semantic generalisation, not memorisation. SIDs give every new product a position in the codebook from day one without requiring transaction history.

  16. Beam width and temperature as exploration dials. "Unlike scoring models, the generative approach unlocks direct tuning mechanisms through beam width and temperature sampling. These serve as precise levers to balance intent and exploration — allowing us to dial up strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces." The same model + index serves multiple surfaces with different precision/exploration trade-offs without retraining.

  17. GPU serving stack — TensorRT-LLM + Triton + Go-native + Griffin 2.0. Verbatim: "As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server. … Implemented as a Go-native service, it delivers higher throughput and lower latency compared to the legacy Python environment. It is fully integrated with Griffin 2.0, Instacart's machine learning serving platform." Three high-speed operations: (1) Input Translation — features dynamically fetched + collated into the input prompt; (2) GPU Model Inference — beam-search SID generation; (3) Product Mapping and Indexing — SIDs mapped via retailer- partitioned index to active ad products "ensuring that only relevant, available, and correctly attributed ads are retrieved."

  18. Headline operational results — paid two costs, banked three wins. Despite producing ~2× the candidate volume, mean retrieval latency decreased by 10–17%. Online A/B test against the incumbent model: +5% click-through rate, +34% add-to- carts (post calls this "step-function increase"). Qualitative: "a customer purchasing pet food at a big box retailer now receives pet-specific recommendations instead of broader grocery suggestions" — tail-category alignment in pet care + beauty.

  19. Brand diversity is the load-bearing wins. "By overcoming the limitations of a fixed token space, TIGER recommended 2.7x more brands and 1.8x more sub-categories than the previous system." The category-conditional gains are dramatic: "+421% in Alcohol, +396% in Beverages, and +229% in Healthcare. In these categories, the previous solution's architectural ceiling prevented these products from being retrieved." Stated thesis: "This unlocks new potential for Instacart's ads ecosystem, creating a valuable opportunity for emerging brands to drive growth by surfacing their products in highly contextual placements."

Architecture & numbers

Datum Value Source / context
Prior model architecture BERT-like Transformer scoring atomic product IDs systems/instacart-contextual-recommendations (CR)
New model architecture Autoregressive decoder generating SIDs token-by-token via beam search systems/instacart-generative-ads-retrieval
Vocabulary unit Atomic product IDs → Semantic IDs (codeword sequences) RQ-VAE-learned codebook
Example SID 35_7_120_184 4 tokens from learned codebooks at different granularity levels
SID prefix-sharing example 35_7_119_493 / 35_7_120_184 / 35_7_120_185 All bread products share 35_7_… prefix
Embedding param-space reduction 125× One of three SID-vocabulary benefits
Prompt template segments retailer-type token + user-history SIDs + cart SIDs Special tokens between segments
Inference primitive Beam search at serve time Multiple promising paths per token step
Mapping layer Retailer-partitioned index Constrains output to available, attributed ads
Tunable dials Beam width + temperature sampling Per-surface intent vs exploration
Serving substrate TensorRT-LLM on NVIDIA Triton Inference Server GPU inference stack
Service language Go-native service Replaces legacy Python+CPU stack
ML platform Griffin 2.0 Instacart's ML serving platform
Candidate volume change ~2× more candidates Larger candidate sets per request
Mean retrieval latency change −10–17% Despite 2× candidate volume
Click-through rate +5% A/B test vs incumbent
Add-to-carts +34% A/B test vs incumbent ("step-function increase")
Brand diversity 2.7× more brands Vs previous system
Sub-category diversity 1.8× more sub-categories Vs previous system
Alcohol diversity +421% Highest dense-category lift
Beverages diversity +396% Second-highest dense-category lift
Healthcare diversity +229% Third-highest dense-category lift
Surfaces launched on Retailer home page + pre-checkout phase Browse-not-search contexts

Architectural primitives extracted

New systems

New concepts

New patterns

Caveats

  • No specific numbers on codebook size, beam width, or temperature. The post discloses that these are tunable but not the production values.
  • No latency p-values disclosed beyond the headline. The 10-17% mean-latency reduction is given; tail (p99/p99.9) latency vs the prior system is not.
  • No QPS / cluster topology / GPU SKU disclosed. Griffin 2.0, TensorRT-LLM, Triton, Go-native are named; throughput envelope and GPU SKUs are not.
  • No training cost or data volume disclosed. "Millions of historical shopping sessions" is the only training-corpus framing; no token budget, training time, or accelerator count given.
  • Surface scope is intentionally narrow. Only retailer home page
  • pre-checkout phase shipped; the post acknowledges these are "contexts where users are browsing rather than searching, and candidate diversity & contextual relevance matter more than surgical precision." Search and post-checkout surfaces remain future work.
  • No comparison to alternative generative-retrieval shapes. TIGER is named as inspiration; Spotify GLIDE/NEO and YouTube PLUM are cited as production references; ActionPiece (Google DeepMind) is cited as a future direction. The post does not benchmark alternative codebook designs (e.g. per-vertical codebooks, learned vs fixed token boundaries).
  • No ablation on context-template segments. Retailer-type token
  • user history + cart SIDs are all present in the launched system; the marginal contribution of each segment is not disclosed.
  • No ranker-side changes disclosed. The post is exclusively about the candidate-generation (retrieval) stage. "If the subsequent ranking model was miscalibrated on these outlier products, these incoherent recommendations from the candidate set would eventually get bubbled up to the user" — the ranker is acknowledged as a failure mode of the prior CG but no ranker work is reported.
  • SID quality is acknowledged as the main lever for downstream improvement but no specific roadmap items have shipped: "Future improvements include multi-resolution codebooks, co-occurrence contrastive regularization, and incorporating dietary constraints into the initial codebook level."
  • No per-retailer / per-vertical breakdown of the headline lifts. +421% Alcohol / +396% Beverages / +229% Healthcare diversity is given, but absolute baselines are not — meaning a +421% lift could reflect either an architecturally constrained baseline of near-zero or a meaningful baseline that was tripled.
  • Ad load and revenue impact not disclosed. +5% CTR and +34% add-to-carts are reported; ad-revenue lift, advertiser ROAS, and bid-density effects are not. (The companion Carrot Ads posts similarly omit ad-revenue lift in absolute terms.)
  • Companion SID post referenced but not consulted here. The post links a separate write-up: Semantic IDs: Product Understanding at Scale — full RQ-VAE design space deferred to that companion. Future ingest of that source would extend systems/instacart-semantic-ids
  • systems/rq-vae + concepts/semantic-id with mechanism detail.
  • Brand-diversity gains may interact with advertiser-economics incentives — emerging brands' ads have systematically lower baseline volume, so a 2.7× brand-count metric on a pCTR-driven serving stack could reflect either a genuine intent-match improvement or a regression in the ranker's ability to suppress brand-spam. The post asserts CTR + ATC are up, suggesting it's the former, but no breakdown by emerging-brand vs established-brand click attribution is given.

Source

Last updated · 542 distilled / 1,571 read