CONCEPT Cited by 2 sources
Semantic ID¶
Definition¶
A Semantic ID is a discrete-token identifier for a recommendable item, encoded as a short sequence of codewords from a learned hierarchical codebook, where semantically similar items share codeword prefixes. Semantic IDs are the vocabulary substrate that makes generative retrieval economical and structurally sensible.
The canonical example (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):
| SID | Product |
|---|---|
35_7_119_493 |
Organic Good Seed Thin Sliced |
35_7_120_184 |
Artisanal Italian Bread |
35_7_120_185 |
Classic Italian Bread |
Shared 35_7_… prefix = bread / bakery semantic neighbourhood.
Shared 35_7_120_… prefix = Italian-bread sub-category.
Three load-bearing properties¶
A Semantic ID substrate has three properties that together justify the substrate change away from atomic item IDs:
- Coverage to every item, regardless of history. New items map to existing codewords from day 1. This addresses recsys cold-start for new products without requiring transaction history.
- Generalisation over memorisation. Models trained on Semantic IDs learn over the codeword space, not over individual product IDs — they generalise rather than overfit co-occurrence patterns.
- Embedding-parameter compression. The embedding table sized to the codebook union is much smaller than one sized to the catalog. Instacart reports a 125× reduction in embedding parameter space.
Required substrate: hierarchical codebook¶
Semantic IDs require a codebook structure that produces shared
prefixes for semantically similar items. The canonical algorithm is
RQ-VAE (Residual Quantized VAE) which trains
K codebooks where each captures the residual of the previous,
yielding a coarse-to-fine hierarchy: first codeword = coarse
semantic neighbourhood, last codeword = fine-grained distinguisher.
The hierarchical property is load-bearing for generative retrieval: when an autoregressive decoder generates the first codeword, it is choosing a coarse semantic neighbourhood; subsequent codewords narrow within that neighbourhood. Beam search at each step explores semantically meaningful options rather than across the entire catalog.
Where Semantic IDs sit in the substrate-design space¶
Three discrete vocabulary substrates for recsys retrieval:
| Substrate | Vocabulary size | Prefix structure | Cold-start | Used for |
|---|---|---|---|---|
| Atomic item IDs | Catalog-bounded | None | Hard | Sequence-model scoring (BERT-like CR, GRU4Rec, SASRec) |
| Embeddings (continuous) | N/A — real vectors | N/A | Easy if learned from features | Two-tower / ANN retrieval |
| Semantic IDs | Codebook-bounded | Hierarchical | Easy | Generative retrieval (TIGER lineage) |
Semantic IDs are the bridge between atomic IDs (discrete enough to be the output of an autoregressive decoder, but with vocabulary bottleneck) and embeddings (rich enough to encode similarity, but continuous so they need ANN search not generation).
What Semantic IDs are NOT¶
- Not unique product identifiers. Multiple products can share an SID. The post-decode mapping layer (e.g. Instacart's retailer-partitioned index) resolves SIDs to specific product candidates.
- Not embeddings. SIDs are discrete codeword sequences that index into learned embeddings inside the consuming model.
- Not surface-specific. A well-designed SID system encodes the catalog once; consuming models on different surfaces (ads, organic, search, post-checkout) can all decode into the same SID space.
Caveats¶
- Codebook-stability across re-training is a non-trivial concern
not yet well-published. If product
X's SID changes between codebook versions, downstream consumers can suffer version skew. - The first-codeword distribution carries disproportionate weight: if the first codebook encodes coarse semantic neighbourhoods unevenly (e.g. one codeword covers half the catalog), beam-search exploration is structurally biased.
- Multi-level codebook design is a live research direction — e.g. multi-resolution codebooks, encoding constraints (dietary, brand) in the first codeword level, contrastive regularisation for co-purchased items.
How they're trained: contrastive loss on catalog structure¶
Vanilla RQ-VAE optimizes only reconstruction fidelity — it has no notion of which products should end up near each other. Without structural guidance, it produces fragmented codes (substitutes split into different branches) and error propagation (sparse-text products land badly + the codebook compresses that bad placement).
The fix (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale): add a contrastive loss term that uses the catalog taxonomy as graded supervision. Pair labels:
| Pair relationship | Label |
|---|---|
| Same leaf | Strong positive |
| Sibling leaf, shared parent | Moderate positive |
| No shared ancestor | Negative |
Loss formula: L_total = L_reconstruction + L_rq + λ · L_contrastive
with λ = 0.01 — "a gentle regularizer: strong enough to improve
coherence, weak enough not to destabilize reconstruction." Coarser
levels (L1, L2) weighted more heavily within the contrastive term.
Companion sampler: hierarchical batch sampling (pick parent → ~half
batch from its children → rest from unrelated → multi-sample within
each slot). (See
concepts/contrastive-regularization-with-catalog-structure +
concepts/hierarchical-batch-sampling-for-contrastive-loss +
patterns/contrastive-loss-via-taxonomy-tree.)
The catalog-tree-supervision choice is explicit cold-start logic: "using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."
Two flavors via different upstream embeddings¶
The same RQ-VAE + contrastive loss + catalog supervision can be trained against different upstream embeddings, producing two distinct cluster characters:
- Precision flavor — domain-specific embedding (e.g. Instacart's ESCI search-relevance model) → tight substitute clusters → substitution / search / reordering.
- Discovery flavor — LLM-cleaned attributes (Gemini-Flash-extracted) + off-the-shelf embedding (Gemma) → broader thematic clusters → homepage feeds / cross-sell / exploration.
The architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem." (See concepts/precision-vs-discovery-codebook-flavor + patterns/two-flavor-codebook-precision-vs-discovery + patterns/llm-attribute-extraction-before-embedding.)
Intrinsic evaluation¶
Three complementary metrics evaluate codebooks directly, not just via downstream task metrics:
| Metric | What it measures |
|---|---|
| Similarity-depth correlation | Hierarchy faithfulness; Spearman ρ between embedding similarity and shared-prefix depth |
| LLM-based cluster evaluation | Functional coherence + purchase likelihood + customer journey relevance |
| Taxonomy alignment | Whether shared-L1 products share a top-level category (and disagreements as audit signal) |
Quote: "Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound." (See patterns/intrinsic-evaluation-of-discrete-codes.)
Catalog-audit dual-use¶
When SIDs disagree with taxonomy labels, the label is often wrong. Examples: a Protein Bar mis-filed under Candy clusters with Sports Nutrition; a Sparkling Water mis-filed under Soda clusters with sparkling waters. Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health." (See concepts/code-vs-label-mismatch-as-catalog-audit + patterns/semantic-code-as-catalog-audit.)
Seen in¶
- sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — deep-companion disclosure: SID generation methodology with catalog-tree contrastive regularization, hierarchical batch sampling, two-flavor design (ESCI precision + ESCI+Gemma discovery), intrinsic evaluation suite (similarity-depth correlation 0.69–0.84, LLM cluster evaluation, taxonomy alignment), catalog-audit dual-use, and concrete vocabulary cardinality (~2,000 codeword tokens for the entire catalog at Instacart).
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure: Instacart Semantic IDs (SIDs) as the production vocabulary substrate for the new generative ads retrieval model.
Related¶
- concepts/generative-retrieval — the paradigm Semantic IDs enable.
- concepts/atomic-product-id-vs-semantic-id — the substrate trade-off canonicalised.
- concepts/vocabulary-bottleneck — the failure mode SIDs solve.
- concepts/cold-start — recsys cold-start axis SIDs solve via codebook coverage.
- concepts/beam-search-retrieval — the inference primitive that consumes SIDs.
- systems/instacart-semantic-ids — production instance.
- systems/rq-vae — the algorithm that produces SIDs.
- systems/tiger-generative-retrieval — the reference paper.
- patterns/rq-vae-codebook-as-product-vocabulary — canonical pattern.