---
title: Semantic IDs: Product Understanding at Scale
source: Instacart Engineering
source_slug: instacart
url: https://tech.instacart.com/semantic-ids-product-understanding-at-scale-5283e0288f5a?source=rss----587883b5d2ee---4
published: 2026-06-02
fetched: 2026-06-03T14:01:22+00:00
ingested: true
---

# **Semantic IDs: Product Understanding at Scale**

Press enter or click to view image in full size

**Key Contributors:**_Shrikar Archak, Karuna Ahuja, Soroush Sobhkhiz, Marko Avdalovic, Xiyu Wang, JiChao Zhang, Hao Yan, Chris Hartley_

## Introduction

Operating a grocery catalog at Instacart’s scale means managing millions of products across thousands of categories. Every product is assigned to a category in our hierarchical taxonomy like “Dairy > Cheese > Parmesan”. These categories provide broad classification, but they miss the connections that drive how customers actually shop.

For example, a customer is building a cheese board. They’ve added Parmigiano Reggiano, and now they need accompaniments. Our taxonomy puts it in “Dairy > Cheese > Parmesan,” so a category-based system can suggest other parmesan cheeses. But it can’t connect them to the Castelvetrano olives in Pantry > Condiments > Olives, the olive tapenade in Deli > Olives Dips and Spreads, or the crudité and pre-assembled cheese tray in Deli > Prepared Meals > Party Trays. These products live in completely different branches of the catalog, with no shared ancestor below “Food.” But any customer would tell you they belong together.

This cross-category blindness shows up in three ways.

**Cold start:** new products arrive with zero purchase history. We can assign them to the right category, but a category alone can’t connect them to the products customers would actually consider alongside them, so they stay invisible.

**Tail category coverage:** recommendation models learn from volume, so they skew toward popular grocery staples. Products in sparse categories lack the interaction data to surface, and the taxonomy gives the model no bridge to related items in other branches.

**Catalog quality at scale:** with millions of products, mislabeling is inevitable — a protein bar filed under “Candy,” a sparkling water under “Soda.” A rigid tree has no way to flag these because the only signal is the label itself.

In this post, we walk through how we built semantic IDs at Instacart to address these problems: the embedding choices, the contrastive training approach that leverages our catalog structure, the two-flavor strategy for precision vs. discovery, and what we learned when things didn’t work.

## What a Semantic ID Looks Like

A semantic ID is a short sequence of integers generated by compressing a product’s embedding through a residual vector quantizer. Products with similar meaning share prefixes; products that differ split at progressively finer levels.

Press enter or click to view image in full size

RQVAE Architecture

Source: Recommender Systems with Generative Retrieval ([TIGER](https://papers.neurips.cc/paper_files/paper/2023/file/20dcab0f14046a5c6b02b61da9f13229-Paper-Conference.pdf))

Here’s what this looks like in practice. Under semantic ID prefix 6_19, our system groups:
    
    
    6_19_32 → Italian cheeses (Parmigiano, Pecorino, Mozzarella, Ricotta)  
    6_19_24 → Specialty cheeses (Brie, Manchego, Halloumi, Goat cheese)  
    6_19_12 → Olives (Castelvetrano, Kalamata, olive medleys)  
    6_19_7  → Tapenades (olive tapenade, spreads)  
    6_19_9  → Deli trays and dips (crudité trays, cheese dips)  
    6_19_14 → Croutons

No one wrote a rule connecting Pecorino Romano to Kalamata olives to olive tapenade. The model learned that these products inhabit the same culinary universe, spanning Dairy, Pantry, and Deli departments, by compressing their embeddings into codes that share a prefix.

Zooming into one branch shows how the hierarchy captures finer distinctions:
    
    
    6_19_32_4  → Fresh Mozzarella, Mozzarella Bars  
    6_19_32_16 → Crumbled Gorgonzola, Blue Cheese Crumbles  
    6_19_32_63 → Hard Italian cheeses (Parmigiano, Pecorino, Asiago)  
    6_19_32_70 → Ricotta Salata

Same first three levels: these are all Italian cheeses. The fourth level captures the functional distinction between fresh, crumbled, hard aged, and ricotta. A customer out of Pecorino Romano might accept Parmigiano Reggiano (same L4, group 63) before reaching for Gorgonzola crumbles (different L4, 16, but same L3).

## The Research Landscape

The idea of discrete learned codes for retrieval isn’t new. Google DeepMind’s [TIGER](https://papers.neurips.cc/paper_files/paper/2023/file/20dcab0f14046a5c6b02b61da9f13229-Paper-Conference.pdf) introduced semantic IDs for generative recommendation. YouTube’s [PLUM](https://arxiv.org/pdf/2510.07784) extended this to production scale with behavior-aligned codebooks. [Mender](https://arxiv.org/pdf/2412.08604) explored mixed semantic enhancement, and [BBQRec](https://arxiv.org/abs/2504.06636) showed how multi-modal signals can inform quantization.

Grocery is a different domain: users fill multi-category shopping lists in a single session, the catalog spans millions of products, and the taxonomy structure we already maintain gives us a supervision signal that media recommendation systems typically don’t have.

## From Embeddings to Codes

Semantic IDs are built on top of product embeddings, high-dimensional vectors where similar products end up nearby in vector space. (For more on our embeddings, see [[How Instacart Uses Embeddings to Improve Search Relevance](/how-instacart-uses-embeddings-to-improve-search-relevance-e569839c3c36)].)

With millions of products, raw embeddings present practical challenges: significant memory, expensive nearest-neighbor search, and incompatibility with discrete token-based systems like LLMs. We compress them using a Residual Vector Quantizer (RQ-VAE), which learns a hierarchical codebook by iteratively quantizing residuals. Each level captures finer distinctions than the last. Our 4-level setup produces IDs like 6_19_32_63, where Level 1 might separate beverages from cleaning supplies and Level 4 distinguishes between brands of hard Italian cheese.

But compression alone isn’t enough. A vanilla RQ-VAE optimizes for reconstruction fidelity. It has no notion of product relationships.

## Teaching the Quantizer What “Similar” Means

Without structural guidance, the quantizer produces two problems: **fragmentation** (two marinara sauces that any customer would consider substitutes end up in different branches) and **error propagation** (a product with product details, category and descriptions gets embedded poorly and placed among irrelevant items). These embeddings are generated from a product’s text, its name, brand, attributes, size description, and category path — using our in-house ESCI (Exact, Substitute, Complementary, Irrelevant) model, which learns representations from search relevance data.

## Contrastive Regularization with Catalog Structure

Inspired by PLUM’s behavioral alignment approach, we added a contrastive term to RQ-VAE training, using our catalog taxonomy as the supervision signal rather than engagement data (which isn’t available for cold-start products). A contrastive loss works by pulling similar items closer in the learned code space and pushing dissimilar items apart.

Rather than binary same/different labels, we define relatedness along a gradient based on where two products sit in the catalog tree. Two products in the same leaf category (say, two marinara sauces from different brands) are strong positives because they share the most specific category. Products in sibling categories (marinara and alfredo, both under “Pasta & Pizza Sauces”) are moderate positives because they share a parent but not a leaf. Products with no shared ancestor (“Pasta Sauce” vs “Office Supplies”) are negatives. The signal isn’t relative to any single product; it’s defined by the structural distance between any pair in the taxonomy.

Press enter or click to view image in full size

Press enter or click to view image in full size

Hierarchical Sampling

## Hierarchical Batch Sampling

The contrastive loss only works if each training batch contains both related and unrelated products. With random sampling over millions of items, most batches would be entirely unrelated — the loss would have no positive signal to learn from.

We fix this by constructing batches deliberately. First, we pick a random parent category (say, “Pasta & Pizza Sauces”). We fill roughly half the batch with products from its child categories — marinara, alfredo, pesto — so the batch naturally contains sibling pairs. We fill the other half with products from unrelated categories (laundry detergent, dog food) to provide hard negatives. Within each category slot, we sample multiple products, so same-leaf pairs (two marinara sauces from different brands) appear automatically. No explicit pair labeling is needed — the catalog structure does the work.

## The Loss
    
    
    L_total = L_reconstruction + L_rq + λ · L_contrastive

The loss aligns embedding similarity with codebook index similarity across all four levels. Coarser levels (L1, L2) are weighted more heavily so broad groupings take priority. With λ = 0.01, the contrastive term is a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction.

## Get Shrikar Archak’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

**What’s next:** incorporating engagement-based signals (substitution patterns, co-purchase data) following PLUM’s approach.

## Two Flavors: Precision vs. Discovery

The two approaches differ starting at the input.

**ESCI (precision)** embeds raw product text (name, brand, description, size, some attributes and categories) through our search relevance model, which was trained on query-product matching. The result: embeddings tuned for “is this the same thing the customer asked for?” This produces tight clusters where every item is a direct substitute, like Whole Bean Coffee (0_8_55_72), where each product is a medium roast from a different brand, interchangeable for any customer who wants whole bean coffee. ESCI powers substitution, search, and reordering.

**ESCI+Gemma (discovery)** takes a different path. It first runs the product through Gemini Flash (~10x faster, ~5x cheaper than full-size models) to extract structured attributes (product type, key ingredients, dietary tags, format), stripping away marketing copy along with the metadata used for ESCI. It then embeds that cleaned representation with Gemma, an off-the-shelf embedding model. The goal is to test whether a general-purpose model, given cleaner inputs, can capture nuances that a domain-specific model misses. The result: broader clusters that capture lifestyle and usage patterns. ESCI+Gemma powers homepage feeds, cross-selling, and exploration.

Press enter or click to view image in full size

Neither is universally better. The key is matching the right flavor to the right surface.

## How We Know It Works

We measure semantic ID quality directly rather than relying solely on downstream metrics.

Press enter or click to view image in full size

Sushi Roll Semantic ID Cluster

**Similarity-depth correlation.** We measure the relationship between embedding similarity and shared semantic ID levels. Correlations of 0.69–0.84 confirm the semantic id hierarchy captures meaningful structure. Among highly similar pairs (≥0.9 cosine similarity), 98–99% share Level 1, declining to 18–37% at Level 4. This is expected, since Level 4 distinguishes between very similar products.

**LLM-based cluster evaluation.** Quantitative metrics tell us whether the hierarchy is structurally sound, but not whether the clusters make functional sense. To assess that, we prompt LLMs to look at each leaf group and score it on three dimensions: functional coherence (do these products serve similar purposes?), purchase likelihood (would a customer buy these together?), and customer journey relevance (do they fit the same shopping context?). This gives us a scalable proxy for human judgment across thousands of clusters. ESCI scores higher on substitutability; ESCI+Gemma excels at thematic coherence, matching their intended use cases.

**Taxonomy alignment.** We check whether products sharing a Level 1 code also share a top-level category. Most do, and the misalignments turned out to be more valuable than the alignments.

## Where It Breaks, and What That Tells Us

**When sparse text produces divergent codes.** Two Riesling wines (0_19_52_63 and 0_31_52_88) share the same category path and 0.86 cosine similarity, yet diverge at L2 due to sparse descriptions. A team branded t-shirt (1_19_21_20) and generic team apparel (1_7_41_59) had 0.95 similarity but matched only at L1. One has a detailed description, the other just four words.

The pattern: sparse or inconsistent text leads to degraded embeddings, which lead to divergent codes. Products with rich descriptions and complete catalog metadata produce more stable codes. Products with minimal text give the embedding model little to work with, and the quantizer faithfully compresses that noise into divergent codes. Enriching product data for these sparse items is an ongoing effort.

**When the system is right and the category is wrong.**

Sometimes a product’s semantic ID disagrees with its taxonomy label. A “Protein Bar” labeled under “Candy” clusters with other protein bars in “Sports Nutrition.” A “Sparkling Water” filed under “Soda” lands among other sparkling waters. In each case, the semantic ID placed the product where it functionally belongs. The error was in the taxonomy, not the code.

This turns semantic IDs into an automated catalog audit. Any product whose cluster assignment disagrees with its category label is a candidate for correction. We’re building this into a pipeline: automated flagging of code-vs-label mismatches, confidence scoring for how strongly a product fits its cluster versus its label, and prioritized review queues for human verification. What started as a recommendation primitive is becoming infrastructure for ongoing catalog health.

## What Semantic IDs Unlock

Building this system taught us a few things that generalize beyond our specific implementation:

**The embedding is the decision.** The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem.

**Catalog structure is a free supervision signal.** PLUM showed behavioral signals are powerful; we showed catalog structure gets you surprisingly far even before you have behavioral data.

**Standardize before you embed.** Lightweight attribute extraction is a high-ROI preprocessing step that reduces noise throughout the pipeline.

**Evaluate codes directly.** Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound.

## From Vocabulary to Language

With ~2,000 codeword tokens representing the entire catalog, generative retrieval becomes possible: a model that _produces_ the semantic ID of the next relevant product, codeword by codeword, conditioned on the user’s context.

We first proved this on product carousels, where the generative approach delivered **+34% add-to-carts** and surfaced products from **2.7x more emerging brands**. Tail categories saw the largest gains, precisely because semantic IDs gave those products a representation the old model couldn’t.

Those carousel results were the starting point. Semantic IDs now power product retrieval, replacement recommendations, and next-item prediction across Instacart. Looking ahead, we’re bringing them to product detail page recommendations, cart assistant suggestions, and ranking features, particularly to address cold start where they have the most leverage.

The broader lesson: semantic IDs started as a compression technique for making embeddings compatible with discrete systems. They became something more, a shared vocabulary that lets every model in our stack reason about product relationships in the same language. The more surfaces that speak this vocabulary, the more value each one gets from it.

## Acknowledgements and Final Notes

We would like to extend deep gratitude to our cross-functional partners _Trace Levinson, Pradeep Karaturi, Shishir Prasad, Tristan Fletcher, Vinesh Gudla, Raochuan Fan, Xiao Xiao, Prakash Putta_ who have provided critical ongoing design feedback and driven system integrations to bring this research to production.

**References**

  * [TIGER](https://papers.neurips.cc/paper_files/paper/2023/file/20dcab0f14046a5c6b02b61da9f13229-Paper-Conference.pdf): Generative Retrieval for Recommendations (Google DeepMind)
  * [PLUM](https://arxiv.org/pdf/2510.07784): Pre-trained Language Models for Industrial-scale Generative Recommendations (YouTube)
  * [Mender](https://arxiv.org/pdf/2412.08604): Generative Recommendation with Mixed Semantic Enhancement
  * [BBQRec](https://arxiv.org/abs/2504.06636): Behavior-Bind Quantization for Multi-Modal Sequential Recommendation
  * How Instacart Uses Embeddings to Improve Search Relevance ([tech.instacart.com](http://tech.instacart.com))
  * Eugene Yan: [Semantic IDs](https://eugeneyan.com/writing/semantic-ids/)