INSTACART Tier 2

Instacart — Semantic IDs: Product Understanding at Scale¶

Summary¶

Instacart's catalog-ML team — Shrikar Archak, Karuna Ahuja, Soroush Sobhkhiz, Marko Avdalovic, Xiyu Wang, JiChao Zhang, Hao Yan, Chris Hartley — discloses how Instacart Semantic IDs (SIDs) are generated: an RQ-VAE trained with a contrastive regularization term that uses Instacart's catalog taxonomy as a graded supervision signal, fed by hierarchical batch sampling that mixes same-leaf, sibling-leaf, and unrelated-category products in each batch. The post is the deep-companion to 2026-06-02's From Scoring to Spelling ads-retrieval ingest (sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart) — that post canonicalized the SID consumer (generative ads retriever); this post canonicalizes the SID generator (the RQ-VAE training methodology + intrinsic evaluation + downstream uses).

Five load-bearing architectural disclosures:

Catalog taxonomy as graded supervision: rather than binary same/different contrastive labels, Instacart defines a gradient of relatedness from the taxonomy structure — same-leaf = strong positive, sibling-leaf (shared parent) = moderate positive, no shared ancestor = negative. Quote: "The signal isn't relative to any single product; it's defined by the structural distance between any pair in the taxonomy."
Hierarchical batch sampling: ~half each batch from children of one randomly-picked parent category (provides sibling pairs); other half from unrelated categories (provides hard negatives); multiple products per category slot (provides same-leaf pairs). "No explicit pair labeling is needed — the catalog structure does the work."
Two flavors of codebooks: ESCI (precision) embeds raw product text through Instacart's in-house ESCI search-relevance model (Exact / Substitute / Complementary / Irrelevant), trained on query-product matching → tight clusters of direct substitutes (powers substitution, search, reordering). ESCI+Gemma (discovery) runs the product through Gemini Flash (~10× faster, ~5× cheaper than full-size Gemini) to extract structured attributes (product type, key ingredients, dietary tags, format), strips marketing copy + ESCI-style metadata, then embeds with Gemma (off-the-shelf) → broader thematic clusters (powers homepage feeds, cross-selling, exploration). Same RQ-VAE, different embedding substrate. Quote: "Neither is universally better. The key is matching the right flavor to the right surface."
Intrinsic evaluation methodology: codes evaluated directly — not just via downstream metrics. Three measurements: (a) similarity-depth correlation (Spearman 0.69–0.84 between embedding cosine-similarity and number of shared SID levels; among ≥0.9-similarity pairs, 98–99% share L1, declining to 18–37% share L4); (b) LLM-based cluster scoring on three dimensions — functional coherence, purchase likelihood, customer journey relevance; (c) taxonomy alignment — products sharing L1 should usually share top-level category.
Code-vs-label mismatch as automated catalog audit: when SID disagrees with category label, the label is often wrong. Two examples: a "Protein Bar" labeled under Candy clusters with other protein bars in Sports Nutrition; a "Sparkling Water" labeled under Soda lands among other sparkling waters. Building a pipeline: automated mismatch flagging + cluster-fit confidence score + prioritized human-review queue. Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health."

The post also confirms (with verbatim citation) the production operational datum previously surfaced by the ads-retrieval post: +34% add-to-carts and 2.7× more emerging brands surfaced via generative retrieval over SIDs in the carousel proof-of-concept, with "Tail categories saw the largest gains, precisely because semantic IDs gave those products a representation the old model couldn't." The post discloses a previously-unstated cardinality — ~2,000 codeword tokens represent the entire catalog, the concrete vocabulary size that makes generative retrieval economical.

The architectural thesis: "semantic IDs started as a compression technique for making embeddings compatible with discrete systems. They became something more, a shared vocabulary that lets every model in our stack reason about product relationships in the same language." SIDs are not a recsys-only primitive — they are now spreading across product retrieval, replacement recommendations, next-item prediction, homepage feeds, cross-selling, exploration, product detail page recommendations, cart assistant suggestions, and ranking features, with catalog audit as an emergent infrastructure use.

Key takeaways¶

Loss formula: L_total = L_reconstruction + L_rq + λ · L_contrastive with λ = 0.01. Quote: "With λ = 0.01, the contrastive term is a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction." Verbatim. Coarser levels (L1, L2) weighted more heavily so broad groupings take priority. (See concepts/reconstruction-vs-semantic-loss-tradeoff.)
Vanilla RQ-VAE optimizes reconstruction fidelity, not product relationships. Without structural guidance, the quantizer produces fragmentation (substitute marinaras land in different branches) and error propagation (poorly-embedded products land among irrelevants). The catalog-structure contrastive term is the architectural fix. (See concepts/contrastive-regularization-with-catalog-structure.)
The hierarchical-codebook payoff is observable in production SIDs. Under prefix 6_19: 6_19_32 = Italian cheeses (Parmigiano, Pecorino, Mozzarella, Ricotta); 6_19_24 = specialty cheeses (Brie, Manchego, Halloumi); 6_19_12 = olives (Castelvetrano, Kalamata); 6_19_7 = tapenades; 6_19_9 = deli trays + dips; 6_19_14 = croutons. Verbatim quote: "No one wrote a rule connecting Pecorino Romano to Kalamata olives to olive tapenade. The model learned that these products inhabit the same culinary universe, spanning Dairy, Pantry, and Deli departments, by compressing their embeddings into codes that share a prefix."
Hierarchy zooms cleanly within a branch. Inside 6_19_32: 6_19_32_4 = fresh mozzarella; 6_19_32_16 = blue cheeses; 6_19_32_63 = hard Italian cheeses (Parmigiano, Pecorino, Asiago); 6_19_32_70 = ricotta salata. Substitution semantics emerge: "A customer out of Pecorino Romano might accept Parmigiano Reggiano (same L4, group 63) before reaching for Gorgonzola crumbles (different L4, 16, but same L3)."
Three failure modes the SID system addresses, named explicitly: cold start (new products with zero history get a code from day 1 via the fixed codebook); tail category coverage (recsys models skew toward popular staples; SIDs "give those products a representation the old model couldn't"); catalog quality at scale (rigid taxonomy can't flag mislabels — "the only signal is the label itself"; SIDs flag them via cluster-vs-label disagreement). (See concepts/tail-category-coverage, concepts/code-vs-label-mismatch-as-catalog-audit.)
Hierarchical sampling makes the contrastive loss work at scale. With random sampling over millions of products, "most batches would be entirely unrelated — the loss would have no positive signal to learn from." Pick a random parent → fill ~half with children of that parent → fill rest with unrelated categories → sample multiple products per slot. The catalog tree provides the pair-labels structurally rather than via expensive human labels. (See concepts/hierarchical-batch-sampling-for-contrastive-loss, patterns/contrastive-loss-via-taxonomy-tree.)
Two-flavor strategy splits precision from discovery at the embedding-substrate level. Same RQ-VAE skeleton + contrastive regularization, different upstream embedding = different cluster character. ESCI (search-relevance trained on query-product matching) → tight substitute clusters; ESCI+Gemma (Gemini-Flash-cleaned attributes → off-the-shelf Gemma embeddings) → broader lifestyle / usage clusters. "The key is matching the right flavor to the right surface." (See concepts/precision-vs-discovery-codebook-flavor, patterns/two-flavor-codebook-precision-vs-discovery.)
Lightweight LLM attribute extraction is high-ROI preprocessing. Quote: "Standardize before you embed. Lightweight attribute extraction is a high-ROI preprocessing step that reduces noise throughout the pipeline." Implementation: Gemini Flash extracts structured attributes (product type, ingredients, dietary tags, format) and strips marketing copy + sparse-metadata noise before the embedding model sees the text. (See patterns/llm-attribute-extraction-before-embedding.)
Intrinsic evaluation catches problems downstream metrics mask. Quote: "Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound." Three intrinsic metrics in production: similarity-depth correlation (Spearman 0.69–0.84, ≥0.9-similarity pairs share L1 at 98–99%, share L4 at 18–37%); LLM cluster scoring on functional coherence + purchase likelihood + customer journey relevance — ESCI scores higher on substitutability, ESCI+Gemma higher on thematic coherence, matching their intended uses; taxonomy alignment — most shared-L1 products share a top-level category. (See concepts/similarity-depth-correlation, concepts/llm-based-cluster-evaluation, patterns/intrinsic-evaluation-of-discrete-codes.)
Sparse text → divergent codes is the dominant failure pattern. Two Riesling wines (0_19_52_63 vs 0_31_52_88) with 0.86 cosine similarity diverged at L2 due to sparse descriptions; a team branded t-shirt (1_19_21_20) and generic team apparel (1_7_41_59) at 0.95 similarity matched only at L1 (one had a detailed description, the other only four words). Quote: "sparse or inconsistent text leads to degraded embeddings, which lead to divergent codes. Products with rich descriptions and complete catalog metadata produce more stable codes." Mitigation: enriching product data for sparse items is "an ongoing effort".
Code-vs-label mismatches are automated audit signal. When SIDs disagree with taxonomy labels, the label is often wrong. Examples: Protein Bar mis-filed under Candy clusters with Sports Nutrition; Sparkling Water mis-filed under Soda lands among other sparkling waters. Quote: "In each case, the semantic ID placed the product where it functionally belongs. The error was in the taxonomy, not the code." Building a pipeline: automated mismatch flagging, confidence scoring (cluster-fit vs label), prioritized human-review queues. (See patterns/semantic-code-as-catalog-audit.)
Production datum confirmed: Generative retrieval over SIDs delivered +34% add-to-carts in the product carousel proof-of-concept and surfaced products from 2.7× more emerging brands. "Tail categories saw the largest gains, precisely because semantic IDs gave those products a representation the old model couldn't." (Same datum surfaced in the From Scoring to Spelling post; this post adds the deeper framing that the tail-category lift is structurally attributable to the codebook coverage of cold/sparse products.)
Vocabulary cardinality disclosed: "With ~2,000 codeword tokens representing the entire catalog, generative retrieval becomes possible: a model that produces the semantic ID of the next relevant product, codeword by codeword, conditioned on the user's context." Concrete vocabulary-bottleneck-escape datum.
SIDs are spreading beyond ads retrieval — they're becoming Instacart's catalog-wide vocabulary. "Semantic IDs now power product retrieval, replacement recommendations, and next-item prediction across Instacart. Looking ahead, we're bringing them to product detail page recommendations, cart assistant suggestions, and ranking features, particularly to address cold start where they have the most leverage." Plus: catalog audit.
What's next (per the post): incorporating engagement-based signals (substitution patterns, co-purchase data) following PLUM's approach — i.e. extending beyond the catalog-structure-only contrastive signal to also leverage user-behavior signal.

Operational numbers¶

Metric	Value	Note
Codeword tokens for entire catalog	~2,000	Vocabulary size disclosure
Codewords per product	4	RQ-VAE depth `K=4` (e.g. `6_19_32_63`)
Contrastive loss weight λ	0.01	Gentle-regularizer setting
Similarity-depth correlation	0.69–0.84	Spearman; intrinsic metric
≥0.9 cosine pairs sharing L1	98–99%	Coarser levels carry more shared mass
≥0.9 cosine pairs sharing L4	18–37%	L4 distinguishes very-similar products
Gemini Flash speedup vs full Gemini	~10× faster	Used for attribute extraction
Gemini Flash cost vs full Gemini	~5× cheaper
Carousel A/B: add-to-carts uplift	+34%	SIDs + generative retrieval vs prior model
Carousel A/B: emerging-brand multiplier	2.7×	More emerging brands surfaced

Architecture: training pipeline¶

                  ┌─────────────────────────────┐
                  │  Product features            │
                  │  (text, brand, attrs, size,  │
                  │   category path)             │
                  └──────────┬──────────────────┘
                             │
                ┌────────────┴───────────┐
                │                        │
                ▼                        ▼
     ┌──────────────────┐    ┌──────────────────┐
     │   ESCI flavor    │    │  ESCI+Gemma      │
     │   (precision)    │    │  flavor          │
     │                  │    │  (discovery)     │
     │ raw text →       │    │                  │
     │ in-house ESCI    │    │ Gemini Flash:    │
     │ search-relevance │    │   extract attrs  │
     │ model            │    │   (product type, │
     │ (query-product   │    │    ingredients,  │
     │  matching)       │    │    dietary,      │
     │                  │    │    format) +     │
     │                  │    │   strip marketing│
     │                  │    │ Gemma (off-shelf)│
     │                  │    │ embeds cleaned   │
     │                  │    │ representation   │
     └────────┬─────────┘    └────────┬─────────┘
              │                       │
              └───────────┬───────────┘
                          │
                          ▼
                ┌────────────────────┐
                │  Continuous        │
                │  embedding e ∈ R^d │
                └─────────┬──────────┘
                          │
                          ▼
              ┌─────────────────────────┐
              │  RQ-VAE training        │
              │                          │
              │  L_total =               │
              │    L_reconstruction      │
              │  + L_rq                  │
              │  + λ · L_contrastive     │
              │  (λ = 0.01)              │
              │                          │
              │  Hierarchical batch      │
              │  sampling: pick parent   │
              │  category → ~half batch  │
              │  from its children →     │
              │  rest from unrelated     │
              │  categories → multiple   │
              │  products per slot       │
              │                          │
              │  Contrastive signal:     │
              │  same-leaf = strong +    │
              │  sibling-leaf = mod +    │
              │  no shared ancestor = -  │
              │                          │
              │  Coarser levels (L1, L2) │
              │  weighted more heavily   │
              └─────────┬───────────────┘
                        │
                        ▼
            ┌────────────────────────┐
            │  4 hierarchical        │
            │  codebooks             │
            │  (~2,000 codeword      │
            │   tokens total)        │
            └─────────┬──────────────┘
                      │
                      ▼
        ┌──────────────────────────┐
        │  Per-product Semantic ID │
        │  e.g. 6_19_32_63          │
        │  (Hard Italian cheeses)   │
        └──────────────────────────┘

Architecture: downstream uses¶

                    Semantic IDs
        (~2,000 codeword tokens, 4 levels deep)
                       │
                       ▼
       ┌───────────────┼───────────────┐
       │               │               │
       ▼               ▼               ▼
  ┌─────────┐    ┌─────────┐    ┌────────────┐
  │ESCI SIDs│    │ESCI+Gemma│   │Both flavors│
  │precision│    │  SIDs    │   │(per surface│
  │         │    │ discovery│   │ choice)    │
  └────┬────┘    └────┬────┘    └─────┬──────┘
       │              │                │
       ▼              ▼                ▼
  ┌─────────────────────────────────────────┐
  │  Substitution / search / reordering      │  ← ESCI
  │  Homepage feeds / cross-selling /        │  ← ESCI+Gemma
  │    exploration                           │
  │  Generative ads retrieval                │  ← (paired post)
  │  Replacement recommendations              │
  │  Next-item prediction                    │
  │  PDP recommendations (planned)           │
  │  Cart assistant suggestions (planned)    │
  │  Ranking features (planned)              │
  │  ─────────────────────────────────────   │
  │  Catalog audit (emergent):               │
  │    code-vs-label mismatch flagging       │
  │    cluster-fit confidence scoring        │
  │    prioritized human-review queues       │
  └─────────────────────────────────────────┘

Two-flavor strategy: ESCI vs ESCI+Gemma¶

The two flavors share the same RQ-VAE quantizer + contrastive regularization + catalog supervision — they differ entirely in what embedding the quantizer compresses. The post frames this as the load-bearing design lever:

"The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem."

Flavor	Upstream	Use cases	Cluster character
ESCI (precision)	Raw product text → in-house ESCI search-relevance model (trained on query-product matching)	substitution, search, reordering	tight substitute clusters; e.g. Whole Bean Coffee (`0_8_55_72`) where every item is a medium roast from a different brand, interchangeable
ESCI+Gemma (discovery)	Gemini Flash extracts structured attributes (product type, key ingredients, dietary tags, format) + strips marketing copy → Gemma (off-the-shelf) embeds cleaned representation	homepage feeds, cross-selling, exploration	broader thematic clusters; lifestyle / usage patterns

Quote on the LLM-attribute-extraction step: "It first runs the product through Gemini Flash (~10x faster, ~5x cheaper than full-size models) to extract structured attributes (product type, key ingredients, dietary tags, format), stripping away marketing copy along with the metadata used for ESCI. It then embeds that cleaned representation with Gemma, an off-the-shelf embedding model. The goal is to test whether a general-purpose model, given cleaner inputs, can capture nuances that a domain-specific model misses."

LLM-cluster-evaluation results (per the post): "ESCI scores higher on substitutability; ESCI+Gemma excels at thematic coherence, matching their intended use cases."

Caveats¶

No engagement-signal contrastive loss yet — only catalog-tree supervision. The post explicitly identifies engagement-based contrastive signals (substitution patterns, co-purchase data, PLUM-style behavioral alignment) as future work.
Cluster-fit confidence-scoring algorithm not disclosed — the catalog-audit pipeline names automated mismatch flagging + confidence scoring + review queues but does not specify the scoring algorithm.
Codebook stability across re-training cadences not addressed — if a product's SID changes between codebook versions, downstream consumers can suffer version skew; the post does not disclose how Instacart manages this.
First-codeword-distribution skew is structurally important (since beam search begins at L1) but not explicitly characterized in the post.
Gemini Flash version not specified — "Gemini Flash" without version qualifier; could be 2.x or 3.x.
Gemma version not specified — "an off-the-shelf embedding model"; specific Gemma variant + size not disclosed.
Loss-weight schedule beyond λ = 0.01 not disclosed (do coarser-level weights w_L1 > w_L2 > w_L3 > w_L4 follow a fixed ratio? geometric? hand-tuned?).
No large-scale catalog-audit deployment numbers — the audit pipeline is described as "building" / "becoming infrastructure", suggesting it's not yet shipped at full scale; no precision / recall / coverage figures for taxonomy correction.
Hyperparameter T (the percentage threshold for treating a parameter as content-affecting) and other RQ-VAE hyperparameters (codebook commitment loss weight, EMA decay for codebook updates, dead-codeword handling) not disclosed.
Reconstruction-vs-semantic loss balance is asserted as gentle ("strong enough to improve coherence, weak enough not to destabilize reconstruction") but no ablation numbers.
Similarity-depth correlation methodology: post says Spearman 0.69–0.84 across some unspecified set of pairs — sample size, pair-construction methodology, and across-flavor comparison detail not disclosed.
LLM-cluster-evaluation methodology: prompt template, judge model, calibration method, inter-annotator-agreement-style validation against humans not disclosed.
Cold-start datum is from carousel A/B: the +34% add-to-carts + 2.7× emerging brands datum is the same one disclosed in the paired ads-retrieval post; this post adds the framing but no new numbers.
PLUM behavioral-alignment as inspiration disclosed: post acknowledges "Inspired by PLUM's behavioral alignment approach, we added a contrastive term to RQ-VAE training, using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)." — the catalog-tree variant is positioned as the cold-start-compatible alternative to PLUM's engagement-data-driven approach.

References cited¶

The post lists five references:

TIGER — Generative Retrieval for Recommendations (Google DeepMind, NeurIPS 2023) — RQ-VAE for semantic IDs.
PLUM — Pre-trained Language Models for Industrial-scale Generative Recommendations (YouTube) — behavior-aligned codebooks; the inspiration for adding a contrastive term.
Mender — Generative Recommendation with Mixed Semantic Enhancement — mixed semantic enhancement.
BBQRec — Behavior-Bind Quantization for Multi-Modal Sequential Recommendation — multi-modal signals informing quantization.
How Instacart Uses Embeddings to Improve Search Relevance (tech.instacart.com) — internal cite for ESCI.
Eugene Yan: Semantic IDs — survey.

Source¶

companies/instacart — Tier-2 source, tenth Instacart story on the wiki.
systems/instacart-semantic-ids — production substrate. Major extension this ingest.
systems/rq-vae — algorithmic substrate. Major extension this ingest.
systems/instacart-esci-model — created this ingest; the in-house search-relevance embedding flavor.
systems/gemma — off-the-shelf embedding model used in the discovery flavor.
systems/gemini — Gemini Flash for attribute extraction.
concepts/semantic-id / concepts/atomic-product-id-vs-semantic-id / concepts/vocabulary-bottleneck / concepts/cold-start / concepts/generative-retrieval — core supporting concepts.
concepts/contrastive-regularization-with-catalog-structure / concepts/hierarchical-batch-sampling-for-contrastive-loss / concepts/similarity-depth-correlation / concepts/llm-based-cluster-evaluation / concepts/code-vs-label-mismatch-as-catalog-audit / concepts/precision-vs-discovery-codebook-flavor / concepts/reconstruction-vs-semantic-loss-tradeoff / concepts/tail-category-coverage — concepts created this ingest.
patterns/rq-vae-codebook-as-product-vocabulary / patterns/contrastive-loss-via-taxonomy-tree / patterns/two-flavor-codebook-precision-vs-discovery / patterns/llm-attribute-extraction-before-embedding / patterns/intrinsic-evaluation-of-discrete-codes / patterns/semantic-code-as-catalog-audit — patterns created or extended this ingest.
sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — companion ads-retrieval post; the SID consumer.