Skip to content

SYSTEM Cited by 2 sources

Instacart Semantic IDs (SIDs)

Definition

Instacart Semantic IDs (SIDs) are Instacart's product-vocabulary substrate for generative ads retrieval: each catalog product is represented as a short sequence of codewords (typically 4 tokens) generated by an RQ-VAE trained on product features, where semantically similar products share prefixes.

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"Instacart Semantic IDs, SIDs, replace atomic product IDs with short sequences of codewords generated by an RQ-VAE. A product's SID looks like 35_7_120_184: four tokens from learned codebooks at different granularity levels."

What SIDs replace

SIDs replace atomic product IDs as the vocabulary unit consumed by Instacart's recommendation models — specifically, the generative ads retrieval model decodes into SIDs token-by-token instead of scoring atomic-product-ID logits across the full catalog. See concepts/atomic-product-id-vs-semantic-id for the substrate trade-off canonicalised.

Prefix-sharing semantic similarity

The post discloses three real example SIDs sharing structure:

SID Product
35_7_119_493 Organic Good Seed Thin Sliced
35_7_120_184 Artisanal Italian Bread
35_7_120_185 Classic Italian Bread

All three products share the first two codewords (35_7_…) — shared bread / bakery semantic neighbourhood. The latter two share the first three codewords (35_7_120_…) — same Italian-bread sub-category. The fourth codeword distinguishes the artisanal vs classic variants.

The implication for the generative retriever: when the model has generated 35_7_120_… as the first three codewords, beam search can only reach Italian-bread products in the final step, which enforces a hierarchical retrieval discipline that the prior flat-distribution scoring model lacked.

Three load-bearing properties

Per the Source:

1. Coverage to every catalog item, regardless of purchase history

"SIDs provide coverage to every item in the catalog, regardless of whether it has a historical purchase history. A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one."

Directly addresses recsys cold-start for new products: the codebook is fixed, every new product is encoded into existing codewords, and the generative retriever can produce that SID's prefix path on day 1 without any transaction history.

2. Generalisation over memorisation

"The model learns to generalize sequences better based on semantic codewords instead of simply learning specific product co-occurrences."

The prior model's atomic-ID vocabulary tended to memorise "co-occurrences instead of learning generalized associations based on the user's intent" — favouring high-frequency staples (milk) over context-aligned tail products (an emerging brand's mustard for a barbecue cart). Codebook-shared prefixes force the model to generalise over the codeword space rather than overfit individual product IDs.

3. Embedding-parameter compression

"The embedding parameter space within the model is decreased by 125x."

The atomic-product-ID embedding table sized to the full catalog (the vocabulary bottleneck) is replaced by a much smaller embedding table sized to the codebook union. This is the engineering property that makes the generative serving stack viable.

Architecture (what we know)

The post discloses that SIDs are produced by an RQ-VAE and links a separate companion write-up — Semantic IDs: Product Understanding at Scale — for the full design space. Architectural details disclosed in this post:

  • 4 codewords per product (the example SID has 4 positions).
  • Codebooks at different granularity levels"four tokens from learned codebooks at different granularity levels" — implying a hierarchical codebook structure (coarse → fine) consistent with the TIGER paper's RQ-VAE design.
  • Per-product encoding deterministic — given product features, the SID is fixed (so prefixes can carry stable semantic meaning across model updates).
  • Catalog can be re-encoded without retraining the consuming model: new products map to existing codewords; existing products presumably keep their SIDs across codebook re-training cadences (otherwise the retrieval consumer would suffer version skew).

Training methodology (2026-06-02 deep-companion disclosure)

The 2026-06-02 Semantic IDs: Product Understanding at Scale post discloses the SID generation methodology. The RQ-VAE is trained with a contrastive regularization term that uses Instacart's catalog taxonomy as graded supervision.

Loss formula

L_total = L_reconstruction + L_rq + λ · L_contrastive

Where (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

  • L_reconstruction — autoencoder reconstruction term.
  • L_rq — RQ-VAE residual-quantization commitment loss.
  • L_contrastive — catalog-structure contrastive term (see concepts/contrastive-regularization-with-catalog-structure).
  • λ = 0.01"a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction."

Coarser codebook levels (L1, L2) are weighted more heavily within the contrastive term, "so broad groupings take priority" — this is the architectural reason a generated SID's first codeword encodes a coarse semantic neighbourhood. (See concepts/reconstruction-vs-semantic-loss-tradeoff.)

Catalog-tree contrastive supervision

Pair labels from taxonomy distance:

Pair relationship Contrastive label
Same leaf category (two marinara sauces) Strong positive
Sibling leaf, shared parent (marinara + alfredo, both under "Pasta & Pizza Sauces") Moderate positive
No shared ancestor ("Pasta Sauce" vs "Office Supplies") Negative

Quote: "The signal isn't relative to any single product; it's defined by the structural distance between any pair in the taxonomy." The choice is explicit: catalog-structure supervision works for cold-start products where engagement-data supervision (PLUM-style) isn't available.

Hierarchical batch sampling

The contrastive loss requires each batch to contain all three pair classes. Random sampling over millions of items would produce batches that are "entirely unrelated — the loss would have no positive signal to learn from." The fix (concepts/hierarchical-batch-sampling-for-contrastive-loss):

  1. Pick a random parent category (e.g. Pasta & Pizza Sauces).
  2. Fill ~half the batch with products from its child categories → sibling-pair signal.
  3. Fill the rest with products from unrelated categories → hard negatives.
  4. Sample multiple products per category slot → same-leaf-pair signal automatically.

Quote: "No explicit pair labeling is needed — the catalog structure does the work."

Vocabulary cardinality

Disclosed cardinality: ~2,000 codeword tokens represent the entire catalog. Quote:

"With ~2,000 codeword tokens representing the entire catalog, generative retrieval becomes possible: a model that produces the semantic ID of the next relevant product, codeword by codeword, conditioned on the user's context."

This is the concrete vocabulary-size datum that makes generative retrieval economical — far smaller than the catalog-bounded vocabulary that atomic IDs would require.

Two flavors: precision vs discovery

Instacart runs two parallel codebooks sharing the same RQ-VAE + contrastive loss + catalog supervision but using different upstream embeddings. (See concepts/precision-vs-discovery-codebook-flavor + patterns/two-flavor-codebook-precision-vs-discovery.)

Flavor Upstream embedding Cluster character Use cases
ESCI (precision) Raw product text → in-house ESCI search-relevance model (trained on query-product matching, Exact / Substitute / Complementary / Irrelevant) Tight substitute clusters; e.g. Whole Bean Coffee (0_8_55_72) where every item is a medium roast from a different brand Substitution, search, reordering
ESCI+Gemma (discovery) Gemini Flash (~10× faster, ~5× cheaper than full-size) extracts structured attributes (product type, key ingredients, dietary tags, format) and strips marketing copy → Gemma (off-the-shelf) embeds the cleaned representation Broader thematic clusters that capture lifestyle and usage patterns Homepage feeds, cross-selling, exploration

Quote: "Neither is universally better. The key is matching the right flavor to the right surface." The architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem."

The LLM-attribute-extraction preprocessing step is what makes the off-the-shelf Gemma embedding competitive: "a general-purpose model, given cleaner inputs, can capture nuances that a domain-specific model misses."

Intrinsic evaluation suite

Three intrinsic metrics run in parallel, evaluating codes directly rather than relying solely on downstream metrics. (See patterns/intrinsic-evaluation-of-discrete-codes.)

Similarity-depth correlation (quantitative)

Spearman correlation between continuous embedding cosine-similarity and shared SID prefix depth. Production codebooks: 0.69–0.84. Stratified by similarity:

Cosine ≥0.9 pairs Share L1 Share L4
% 98–99% 18–37%

Hierarchical decline is the expected shape — most similar pairs share coarse neighborhoods; only the very-most-similar share the finest-distinguishing level. (See concepts/similarity-depth-correlation.)

LLM-based cluster evaluation (qualitative)

LLM judges score each leaf cluster on three dimensions:

  • Functional coherence (substitute-axis)
  • Purchase likelihood (co-purchase axis)
  • Customer journey relevance (context axis)

Used to discriminate flavors: "ESCI scores higher on substitutability; ESCI+Gemma excels at thematic coherence, matching their intended use cases." (See concepts/llm-based-cluster-evaluation.)

Taxonomy alignment (structural)

Most products sharing L1 share a top-level category. Disagreements become catalog-audit signal (see Catalog-audit dual-use below).

Catalog-audit dual-use

A surprising downstream use: when SIDs disagree with taxonomy labels, the label is often wrong. Examples disclosed:

  • A Protein Bar filed under Candy clusters with other protein bars in Sports Nutrition.
  • A Sparkling Water filed under Soda lands among other sparkling waters.

Quote: "In each case, the semantic ID placed the product where it functionally belongs. The error was in the taxonomy, not the code."

In-progress catalog-audit pipeline:

  • Automated mismatch flagging.
  • Cluster-fit confidence scoring.
  • Prioritized human-review queues.

Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health." (See concepts/code-vs-label-mismatch-as-catalog-audit + patterns/semantic-code-as-catalog-audit.)

Failure modes

Two named divergent-code failure cases (Source: same):

Pair Cosine sim Shared prefix
Two Riesling wines (0_19_52_63 vs 0_31_52_88) 0.86 L1 only (mismatched at L2)
Team t-shirt vs generic team apparel (1_19_21_20 vs 1_7_41_59) 0.95 L1 only

Root cause for both: sparse text — one product had detailed descriptions, the other only four words. Quote: "sparse or inconsistent text leads to degraded embeddings, which lead to divergent codes. Products with rich descriptions and complete catalog metadata produce more stable codes." Mitigation: enriching product data for sparse items is "an ongoing effort".

Production datum

From the carousel proof-of-concept (Source: same):

  • +34% add-to-carts with generative retrieval over SIDs vs prior model.
  • 2.7× more emerging brands surfaced.
  • "Tail categories saw the largest gains, precisely because semantic IDs gave those products a representation the old model couldn't." (See concepts/tail-category-coverage.)

The companion ads-retrieval post quantifies category-conditional diversity lifts: +421% Alcohol, +396% Beverages, +229% Healthcare.

Spreading beyond ads retrieval

Per the 2026-06-02 deep-companion post, SIDs now power:

  • Product retrieval
  • Replacement recommendations
  • Next-item prediction

Roadmap:

  • Product detail page recommendations
  • Cart assistant suggestions
  • Ranking features
  • Catalog audit pipeline (above)

Quote: "Looking ahead, we're bringing them to product detail page recommendations, cart assistant suggestions, and ranking features, particularly to address cold start where they have the most leverage." The SID system is becoming Instacart's catalog-wide shared vocabulary across surfaces.

What SIDs don't do

  • They are not unique product identifiers — multiple products can share an SID (the post: "This essentially compresses the product vocabulary, as multiple very similar products are represented by a single SID"). The retailer-partitioned index (concepts/retailer-partitioned-index) maps a generated SID to the actual ad-eligible product candidates.
  • They are not embeddings — they are discrete codeword sequences that index into learned embeddings inside the consuming model.
  • They are not specific to ads retrieval — the post positions SIDs as a general product-understanding substrate, with the generative ads retrieval model being the first production consumer.

Design space (deferred)

The Source explicitly identifies SID quality as the load-bearing lever for downstream improvement:

"The quality of the codebook is fundamental to everything downstream, impacting retrieval precision, brand diversity, and coherence. Future improvements include multi-resolution codebooks, co-occurrence contrastive regularization, and incorporating dietary constraints into the initial codebook level. A full design space is covered in our companion post."

These are explicit design-space directions:

  • Multi-resolution codebooks — codebooks at multiple granularity scales beyond the current hierarchical 4-codeword shape.
  • Co-occurrence contrastive regularisation — training-time loss to push together SIDs of co-purchased products and apart SIDs of unrelated products.
  • Dietary constraints in initial codebook level — encoding attributes like vegan / gluten-free / kosher into the first codeword position so beam search can be constrained at the earliest decoding step.

Caveats

  • Codebook size (number of codewords per position, total codebook cardinality) not disclosed.
  • RQ-VAE training-time data, loss function, encoder architecture not disclosed (deferred to the companion post).
  • Re-training cadence not disclosed; the post does not address how prefix-stability is maintained across codebook regeneration.
  • SID assignment for new products at runtime not addressed (online vs. periodic batch).
  • Whether SIDs are shared across surfaces (ads vs. organic recommendations vs. search) not disclosed; the prior CR model served "both ads and organic content", suggesting SIDs may similarly span surfaces.

Seen in

  • sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scaledeep-companion disclosure: SID generation methodology (RQ-VAE training with catalog-tree contrastive regularization + hierarchical batch sampling), two-flavor design (ESCI precision
  • ESCI+Gemma discovery), intrinsic evaluation suite (similarity-depth correlation 0.69–0.84 + LLM cluster evaluation + taxonomy alignment), and catalog-audit dual-use via code-vs-label mismatch detection. Discloses ~2,000 codeword tokens for the entire catalog, λ=0.01 contrastive weight, Spearman 0.69–0.84 similarity-depth correlation, sparse-text divergent-code failure cases (Riesling, t-shirt examples), Protein-Bar / Sparkling-Water mislabel-detection cases, and forward roadmap including engagement-based contrastive signals.
  • sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure (the consumer side); SIDs as the vocabulary substrate for the generative ads retriever.
Last updated · 542 distilled / 1,571 read