---
title: From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart
source: Instacart Engineering
source_slug: instacart
url: https://tech.instacart.com/from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart-cf36b4e8d1bb?source=rss----587883b5d2ee---4
published: 2026-06-02
fetched: 2026-06-03T14:01:19+00:00
ingested: true
---

# **From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart**

Press enter or click to view image in full size

**Key Contributors: Karuna Ahuja, Marko Avdalovic, Soroush Sobhkhiz, Shrikar Archak, Xiyu Wang, Ji Chao Zhang, Hao Yan**

## Introduction

Every time a user opens Instacart, they see product recommendations: on the retailer home page, in search results, and alongside their cart. Many of these recommendations are sponsored products surfaced by a retrieval model that decides which products to show from a vast ads product catalog. A relevant ad helps users discover products they didn’t know they needed; a less relevant one generates friction.

Two years ago, we[ introduced Contextual Recommendations (CR)](/sequence-models-for-contextual-recommendations-at-instacart-93414a28e70c), a BERT-based sequence model powering retrieval for both ads and organic recommendations across all major browse surfaces. In this post, we’ll focus on our ads retrieval. We will detail how we rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token. By doing so, we unlocked a new level of contextual matching — ensuring brands appear exactly when users want them, while simultaneously opening up discovery of thousands of relevant products the previous system couldn’t retrieve.

## **Contextual Recommendations: A recap**

At its core, CR treats grocery shopping as a language modeling task, where atomic product IDs function as tokens and, the finite subset of the catalog it is trained on, acts as its ‘vocabulary’.

The model leverages the user’s real-time session, which includes product views, item page visits, and cart additions, as a sequence of these product tokens. A BERT-like transformer is then trained on millions of authentic shopping sessions to predict the next token (i.e. singular product) in the sequence. This process allows the model to learn and capture complex purchasing patterns, such as the tendency for users who add pasta and olive oil to frequently add garlic next.

This single retrieval layer replaced multiple ad-hoc systems and powers recommendation carousels across all major browse surfaces, serving both ads and organic content. At inference time, it scores every product ID in its vocabulary against the current session and returns the top K products.

For the full technical details, see our[ previous blog post](/sequence-models-for-contextual-recommendations-at-instacart-93414a28e70c).

## When Scoring Stops Scaling

Since launching CR, we iterated on two fronts to improve the underlying model; improving the coverage of our catalog and adding more context.

First, we expanded the product vocabulary the model trains on. This helped us expand the retrieval coverage. Second, we added richer context through retailer awareness and long-term user personalization. Both the upgrades led to meaningful gains in add-to-carts and ads coverage, particularly for specialty retailers and short shopping sessions.

Each of the above improvements operated within the same fundamental architecture: score every product in a candidate set, return the top K products. However, as the catalog grew and user shopping journeys became more diverse, this architecture presented three constraints that placed a ceiling on our discovery potential, especially for our ad recommendations:

**The vocabulary bottleneck:** The CR model relies on atomic product IDs as distinct tokens, which establishes the boundaries of what the model can interpret and predict. While expanding this vocabulary enhances the model’s ability to understand the detailed context of a user’s session, it simultaneously increases model size and latency while creating data sparsity for less common items. Additionally this catalog is non-stationary. As new products are added to the catalog, the coverage gap keeps expanding. Consequently, relying solely on vocabulary expansion proved insufficient for representing the full breadth of the catalog, as specialized products often remained outside the model’s recognizable token set.

**The ‘cold start’ hurdle:** To train this model, the historical shopping sessions were designed as sequences of atomic product IDs. This occasionally caused it to memorize co-occurrences instead of learning generalized associations based on the user’s intent. This resulted in the model favoring high-frequency items over newer products which are more aligned with the user’s context. For instance, while a user is building a cart toward a summer barbecue [eg: ground beef, hamburger buns, lettuce], the previous system had a tendency to default to a generic grocery staple [eg: milk] rather than surfacing an emerging brand’s condiment [eg: mustard] that fits the intent better. This collaborative filtering approach, while effective at a baseline, often lacked the responsiveness of the model to recommend products based on what the user is _actually doing right now_.

**The structural drift:** The final candidate set from the model is generated by predicting a probability distribution across the entire vocabulary of product IDs. Without a built-in hierarchy to keep the recommendations focused, the model occasionally retrieves a disjointed mix of items. For example, a breakfast-themed cart [e.g., milk, eggs, cereal] may lead to laundry detergent being retrieved along with other valid recommendations [e.g., bread, muffins]. If the subsequent ranking model was miscalibrated on these outlier products, these incoherent recommendations from the candidate set would eventually get bubbled up to the user next to a perfectly good set of recommendations.

These technical constraints ultimately limited how well we could connect users with the full breadth of our ads catalog, resulting in missing tail categories, narrower brand representation, and the occasional misaligned recommendation. We needed to rethink the product representation, the architecture, and the retrieval mechanism, not just add more features to the same model.

## Teaching the Model to Spell

Our new approach is inspired by [TIGER](https://papers.neurips.cc/paper_files/paper/2023/file/20dcab0f14046a5c6b02b61da9f13229-Paper-Conference.pdf) (Google DeepMind), a method that demonstrates a model’s ability to _generate_ the semantic tokens of the next relevant item, rather than merely scoring a predetermined set of candidates. This generative paradigm has been adopted in production by companies such as Spotify ([GLIDE](https://arxiv.org/abs/2603.17540),[ NEO](https://arxiv.org/abs/2603.17533)) and YouTube ([PLUM](https://arxiv.org/abs/2510.07784)).

However, Instacart’s ad retrieval presents distinct challenges rooted in the unique nature of grocery shopping. Unlike platforms where the user’s intent is narrow, Instacart users often manage a highly diverse shopping list including items from fresh food to cleaning supplies and pet care — sometimes all within a single session. The user’s intent shifts mid-cart. Users shop across various retailers on our marketplace, each with a unique product catalog.

To address this, our model must look beyond historical purchases; it must also account for the real-time dynamics of the active shopping session. This is also where we have an opportunity to unlock new product discovery. Instead of just picking some atomic product IDs from a massive list, we needed a model that could look at the user’s cart, leverage its learned semantic concepts, and “autocomplete” the rest of the session. By generating an abstract concept rather than a specific item, the model shifts from memorization to generalization. This instantly connects the shopper’s intent to the full depth of our catalog, even for products with zero transaction history.

Building this for grocery required two things: a new product vocabulary, and a new way to use it.

## Instacart Semantic IDs: A new Product Vocabulary

Before we could build this new retrieval system, we needed to change how we represented products. This motivated us to invest in building [Instacart Semantic IDs](/semantic-ids-product-understanding-at-scale-5283e0288f5a).

[Instacart Semantic IDs](/semantic-ids-product-understanding-at-scale-5283e0288f5a), SIDs, replace atomic product IDs with short sequences of codewords generated by an RQ-VAE. A product’s SID looks like 35_7_120_184: four tokens from learned codebooks at different granularity levels. Semantically similar products share prefixes:
    
    
    35_7_119_493 → Organic Good Seed Thin Sliced  
    35_7_120_184 → Artisanal Italian Bread  
    35_7_120_185 → Classic Italian Bread

This essentially compresses the product vocabulary, as multiple very similar products are represented by a single SID. Leveraging this new representation in our retrieval model provided three major benefits:

  * SIDs provide coverage to every item in the catalog, regardless of whether it has a historical purchase history. A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one.
  * The model learns to generalize sequences better based on semantic codewords instead of simply learning specific product co-occurrences.
  * The embedding parameter space within the model is decreased by **125x**.


## **The Context Template: A new Training Corpus**

This new compact SID format also fundamentally changes how we construct our training data.

## Get Karuna Ahuja’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

In the previous model, mapping atomic product IDs consumed the entire token capacity. By shifting to SIDs, we freed up the massive dictionary space, allowing us to design a richer input training corpus based on Instacart domain data. We achieved this by mining millions of historical shopping sessions and enriching them with new context tokens. We formulate the session into the following sequence template:

Press enter or click to view image in full size

Each segment of this prompt serves a distinct role, and they are separated by special tokens:

  * A **retailer type** token tells the model which catalog and shopping context the user is shopping in. Because our marketplace retailers span grocery, pet, beauty, home goods, and more, this token helps us capture the distinction.
  * **User history** SIDs from past purchases capture long-term preferences. By taking the top N previously purchased SIDs and expressing them in the same token format the model generates in, we seamlessly connect past behavior to future predictions.
  * **Cart** SIDs capture the real-time intent of the current session. While user history tells the model what someone typically likes, the cart SIDs tells it what they are building _today_ , adapting as new items are added.


During training, the model reads this template and learns to autoregressively generate the SID of the next item the user adds to their cart. The template structure also gives us a clean interface for future signals (such as occasion awareness, search queries, page type) without architectural changes. Each new signal is simply a new segment in the prompt.

## From Input to Candidates: A new Retrieval Paradigm

During serving, we build the candidate set via beam search. As illustrated in the diagram below, the decoder reads this input and generates recommendations token by token. At each step, beam search explores multiple promising paths for the next codeword. This ultimately yields several distinct, fully formed SID sequences. Finally, these generated sequences are mapped against a retailer-partitioned index to retrieve a diverse variety of relevant, available ad products.

Press enter or click to view image in full size

This new paradigm, combined with the catalog coverage SIDs already provide, directly addresses the three limitations we hit with the previous solution:

  * **Eliminating the bottleneck:** By generating sequences from a small, fixed set of codewords rather than scoring an ever-expanding list of product IDs, the scaling constraints of the vocabulary bottleneck disappear. The model constructs the semantic representation of the next item on the fly, avoiding the memory and latency penalties that previously restricted our catalog coverage.
  * **Inherent structural coherence:** Generating auto regressively means each codeword is explicitly conditioned on the previous one. This enforces a strict hierarchy during retrieval. If the model begins generating a prefix for “Produce,” the beam search remains confined to that semantic neighborhood, actively preventing the random outlier leakage caused by flat probability distributions.
  * **Dynamic diversity dials:** Unlike scoring models, the generative approach unlocks direct tuning mechanisms through beam width and temperature sampling. These serve as precise levers to balance intent and exploration — allowing us to dial up strict precision on search pages, while turning up brand diversity and discovery on post-checkout surfaces.


## Rebuilding Serving Infrastructure

As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia’s Triton Inference Server.

The new serving stack represents a fundamental shift in architecture. Implemented as a **Go-native service** , it delivers higher throughput and lower latency compared to the legacy Python environment. It is fully integrated with [Griffin 2.0,](/introducing-griffin-2-0-instacarts-next-gen-ml-platform-b7331e73b8d7) Instacart’s machine learning serving platform, streamlining deployment and maintenance within our ecosystem.

## How the New Stack Works

This new serving stack performs a series of critical, high-speed operations to power the new ads retrieval system:

  * **Input Translation:** Features are dynamically fetched and collated to create the input prompt.
  * **GPU Model Inference:** The model runs inference and generates relevant SID sequences.
  * **Product Mapping and Indexing:** Finally, the generated SIDs are mapped back to active ad products via a specialized, highly efficient **retailer-partitioned index** , ensuring that only relevant, available, and correctly attributed ads are retrieved.


Despite producing approximately **2x the candidate volume** , mean retrieval latency decreased by **10–17%** , validating the investment in the new GPU serving architecture.

## Measuring the Impact

We launched this system to power ads carousels on two discovery surfaces that bookend a user’s shopping journey: a retailer home page at the very beginning, and the pre-checkout phase just before the order is finalized. These are contexts where users are browsing rather than searching, and candidate diversity & contextual relevance matter more than surgical precision.

In our online A/B tests against the incumbent model, the generative approach delivered a **+5%** improvement in click-through rate. What was truly transformative, however, was the **+34%** step-function increase in add-to-carts. This indicates that we didn’t just capture more clicks; we significantly elevated the quality of that engagement, converting passive exploration into tangible downstream conversions.

While the quantitative metrics showcase the remarkable impact of this new solution, the qualitative improvements are just as striking. Here are a few ways we saw our recommendations evolve.

## **What the New Recommendations Look Like**

**Tail Category Alignment**

The generative architecture allows the model to generalize more effectively than its predecessor, resulting in deeper session understanding and recommendations that align more closely with a user’s active shopping trip.This improved performance is most evident in “tail” segments such as beauty and pet care. Previously, hard vocabulary constraints often filtered out these specialized items, causing the system to fallback on high-frequency, generic grocery staples.

For example, a customer purchasing pet food at a big box retailer now receives pet-specific recommendations instead of broader grocery suggestions. This hyper-relevant, session-aware targeting is a primary driver behind the engagement movement, as users are finally seeing ads that actually match the current intent of their cart. Here’s a sample anonymized example from real production logs.

Press enter or click to view image in full size

**Improved Brand Diversity: The Most Impactful Outcome**

Building on this enhanced session understanding, the new architecture also unlocked a substantial boost in brand diversity by reaching deeper into our catalog. By overcoming the limitations of a fixed token space, TIGER recommended **2.7x** more brands and **1.8x** more sub-categories than the previous system. This demonstrates that the model goes beyond simply relying on the transaction history of well-known SKUs. Instead, it dynamically identifies and suggests emerging brands that align with a user’s specific intent, significantly enhancing product exploration.

We saw the most substantial diversity gains in highly dense categories, driving improvements of +421% in Alcohol, +396% in Beverages, and +229% in Healthcare. In these categories, the previous solution’s architectural ceiling prevented these products from being retrieved.

This unlocks new potential for Instacart’s ads ecosystem, creating a valuable opportunity for emerging brands to drive growth by surfacing their products in highly contextual placements.

## What’s Next

With Instacart Semantic IDs as the vocabulary and this new generation engine, we’ve successfully demonstrated the move from scoring a fixed candidate set to producing recommendations autoregressively from user context. Here’s where it goes from here.

**SID quality:** The quality of the codebook is fundamental to everything downstream, impacting retrieval precision, brand diversity, and coherence. Future improvements include multi-resolution codebooks, co-occurrence contrastive regularization, and incorporating dietary constraints into the initial codebook level. A full design space is covered in our companion post [[SID](/semantic-ids-product-understanding-at-scale-5283e0288f5a)].

**Richer context engineering:** Now that our template can seamlessly absorb new signals, we are focusing on feeding the model higher-order intent. By injecting real-time search queries, specific page contexts, and detected shopping occasions directly into the prompt, we can push the model to generate even more surgical, intent-driven recommendations.

**From retrieval model to discovery platform:** This initiative has given us a generative model operating over product tokens. The natural next step, and one that YouTube’s[ PLUM](https://arxiv.org/abs/2510.07784) has validated at scale, is a multilingual model that reasons across both SID tokens and natural language. For us, this means learning from diverse grocery domain data like search queries, product descriptions, and shopping occasions alongside cart sequences. A model that can reason across both vocabularies opens up instruction-following interfaces, set-completion training, and richer user modeling. And[ ActionPiece](https://arxiv.org/abs/2502.13581) (Google DeepMind) has shown that user _actions_ , not just items, can be tokenized and generated in a context-aware manner, hinting at a future where the same architecture could power _what we show users next_ : a reorder nudge, a recipe suggestion, or a discovery carousel?

The core architecture doesn’t change; the training recipe does. We are no longer just retrieving ads — we are building an AI that fluidly speaks the language of grocery, creating a richer, more intuitive discovery experience for users and advertisers alike.

We’re early in this journey, but the foundation is in place.

## Acknowledgements and Final Notes

We would like to extend deep gratitude to our cross-functional partners _Chang Zhang, Walter Tuholski, Trevor Yao, Cheng Jia, Joseph Haraldson, Nick Cooley, Tristan Fletcher_ who have provided critical ongoing design feedback and driven system integrations to bring this research to production.

## References