Skip to content

SYSTEM Cited by 1 source

Instacart Generative Ads Retrieval

Definition

Instacart Generative Ads Retrieval is the candidate-generation (retrieval) stage of Instacart's ads platform on browse surfaces (retailer home page + pre-checkout). It is an autoregressive Transformer decoder that generates the next recommended item token-by-token as a sequence of Semantic IDs (SIDs) via beam search, replacing the prior BERT-based scoring model (CR) that predicted a probability distribution over the entire atomic-product-ID vocabulary.

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"We rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token."

The architecture is inspired by TIGER (Google DeepMind, NeurIPS 2023) and adopted in the same generative-paradigm wave as Spotify GLIDE/NEO and YouTube PLUM.

Where it sits in the stack

                 retailer home page / pre-checkout request
              ┌──────────────────────────────────────────┐
              │       Candidate Generator (CG)           │
              │  ┌────────────────────────────────────┐  │
              │  │  Input Translation                 │  │
              │  │  → context template prompt:        │  │
              │  │    [retailer-token]                │  │
              │  │    [user-history-SID-1...N]        │  │
              │  │    [cart-SID-1...M]                │  │
              │  └──────────────┬─────────────────────┘  │
              │                 ▼                         │
              │  ┌────────────────────────────────────┐  │
              │  │  GPU Model Inference               │  │
              │  │  → autoregressive decoder           │  │
              │  │  → beam search over codeword steps │  │
              │  │  → K distinct full SID sequences    │  │
              │  └──────────────┬─────────────────────┘  │
              │                 ▼                         │
              │  ┌────────────────────────────────────┐  │
              │  │  Product Mapping & Indexing        │  │
              │  │  → retailer-partitioned index       │  │
              │  │  → SIDs → available, attributed ads │  │
              │  └──────────────┬─────────────────────┘  │
              └─────────────────┼─────────────────────────┘
                        downstream Carrot Ads ranker
                            user impression

The three serving operations

Per the Source page:

  1. Input Translation"Features are dynamically fetched and collated to create the input prompt." The prompt template (patterns/context-template-prompt-with-special-tokens) is assembled from retailer-type token + top-N user-history SIDs + cart SIDs, with special tokens delimiting segments.
  2. GPU Model Inference"The model runs inference and generates relevant SID sequences." Autoregressive decoder + beam search over codeword positions; produces K distinct fully-formed SID sequences per request.
  3. Product Mapping and Indexing"The generated SIDs are mapped back to active ad products via a specialized, highly efficient retailer-partitioned index, ensuring that only relevant, available, and correctly attributed ads are retrieved."

What changes between training and serving

Training is conventional next-token prediction: "During training, the model reads this template and learns to autoregressively generate the SID of the next item the user adds to their cart." The training objective is the SID of the actual next item the user added; the model learns to generate that SID conditioned on the prompt prefix.

Serving uses beam search rather than greedy decoding: "At each step, beam search explores multiple promising paths for the next codeword. This ultimately yields several distinct, fully formed SID sequences." Beam width and temperature are exposed as runtime knobs (concepts/diversity-via-beam-and-temperature) so the same model can be tuned for different surfaces — "strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces."

How it dissolves three structural ceilings of the prior CR

The prior CR model hit three structural ceilings each addressed by a property of the generative paradigm (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

Ceiling on CR Why generative dissolves it
Vocabulary bottleneck — model size and latency grow with the catalog; data sparsity for tail items; non-stationary catalog widens coverage gap Fixed codebook size; "the model constructs the semantic representation of the next item on the fly, avoiding the memory and latency penalties that previously restricted our catalog coverage."
Cold-start hurdle — co-occurrence memorisation favours high-frequency items over intent-aligned newer products SIDs share prefixes for semantically similar products; "a new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one."
Structural drift — flat probability distribution over atomic IDs occasionally retrieves disjointed mix (laundry detergent in a breakfast cart) Autoregressive prefix conditioning; "each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval."

Serving substrate

Per the Source: "As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack."

  • Inference engine: TensorRT-LLM — NVIDIA's high-performance LLM inference compiler.
  • Serving runtime: NVIDIA Triton Inference Server.
  • Service shell: Go-native service"delivers higher throughput and lower latency compared to the legacy Python environment."
  • ML platform integration: fully integrated with Griffin 2.0, Instacart's ML serving platform.

The serving-substrate change — from Python+CPU to Go+GPU — is what made the order-of-magnitude-larger compute budget of autoregressive decoding economically viable; the post explicitly frames this as a prerequisite, not an afterthought.

Operational outcomes

Metric Value
Candidate volume ~2× more candidates per request
Mean retrieval latency −10–17% (despite 2× volume)
Click-through rate +5%
Add-to-carts +34% (post calls "step-function increase")
Brand diversity in recommendations 2.7× more brands
Sub-category diversity 1.8× more sub-categories
Alcohol category diversity +421%
Beverages category diversity +396%
Healthcare category diversity +229%

Surfaces launched on

Two browse surfaces explicitly named: - Retailer home page — the start of a shopping session - Pre-checkout phase — just before the order is finalized

Per the Source: "these are contexts where users are browsing rather than searching, and candidate diversity & contextual relevance matter more than surgical precision." The retailer home page maximises discovery; pre-checkout maximises basket-completion / brand-diversity exposure on the way to purchase.

Search and post-checkout surfaces are explicitly named as future candidates for the same model with different beam-width / temperature settings.

Composition with the rest of the Instacart ads stack

This system is the candidate generator in the Carrot Ads stack; the generated candidates feed the Carrot Ads pCTR ranker which scores them against the real-time auction. The pCTR ranker is unchanged by this work — the post is exclusively about retrieval — which means brand-diversity / cold-start gains have to compose with the existing pCTR scoring without ranker miscalibration.

Caveats

  • Codebook size, beam width, temperature settings not disclosed.
  • p99/p99.9 latency vs CR not disclosed.
  • GPU SKU / cluster topology / cost not disclosed.
  • Surfaces remain limited to two; search and post-checkout deferred.
  • Ranker-side (pCTR) changes not addressed.
  • "If the subsequent ranking model was miscalibrated on these outlier products, these incoherent recommendations from the candidate set would eventually get bubbled up to the user" — acknowledges ranker-CG calibration risk but no mitigation reported.

Seen in

Last updated · 542 distilled / 1,571 read