CONCEPT Cited by 2 sources

Semantic ID¶

Definition¶

A Semantic ID is a discrete-token identifier for a recommendable item, encoded as a short sequence of codewords from a learned hierarchical codebook, where semantically similar items share codeword prefixes. Semantic IDs are the vocabulary substrate that makes generative retrieval economical and structurally sensible.

The canonical example (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

SID	Product
`35_7_119_493`	Organic Good Seed Thin Sliced
`35_7_120_184`	Artisanal Italian Bread
`35_7_120_185`	Classic Italian Bread

Shared 35_7_… prefix = bread / bakery semantic neighbourhood. Shared 35_7_120_… prefix = Italian-bread sub-category.

Three load-bearing properties¶

A Semantic ID substrate has three properties that together justify the substrate change away from atomic item IDs:

Coverage to every item, regardless of history. New items map to existing codewords from day 1. This addresses recsys cold-start for new products without requiring transaction history.
Generalisation over memorisation. Models trained on Semantic IDs learn over the codeword space, not over individual product IDs — they generalise rather than overfit co-occurrence patterns.
Embedding-parameter compression. The embedding table sized to the codebook union is much smaller than one sized to the catalog. Instacart reports a 125× reduction in embedding parameter space.

Required substrate: hierarchical codebook¶

Semantic IDs require a codebook structure that produces shared prefixes for semantically similar items. The canonical algorithm is RQ-VAE (Residual Quantized VAE) which trains K codebooks where each captures the residual of the previous, yielding a coarse-to-fine hierarchy: first codeword = coarse semantic neighbourhood, last codeword = fine-grained distinguisher.

The hierarchical property is load-bearing for generative retrieval: when an autoregressive decoder generates the first codeword, it is choosing a coarse semantic neighbourhood; subsequent codewords narrow within that neighbourhood. Beam search at each step explores semantically meaningful options rather than across the entire catalog.

Where Semantic IDs sit in the substrate-design space¶

Three discrete vocabulary substrates for recsys retrieval:

Substrate	Vocabulary size	Prefix structure	Cold-start	Used for
Atomic item IDs	Catalog-bounded	None	Hard	Sequence-model scoring (BERT-like CR, GRU4Rec, SASRec)
Embeddings (continuous)	N/A — real vectors	N/A	Easy if learned from features	Two-tower / ANN retrieval
Semantic IDs	Codebook-bounded	Hierarchical	Easy	Generative retrieval (TIGER lineage)

Semantic IDs are the bridge between atomic IDs (discrete enough to be the output of an autoregressive decoder, but with vocabulary bottleneck) and embeddings (rich enough to encode similarity, but continuous so they need ANN search not generation).

What Semantic IDs are NOT¶

Not unique product identifiers. Multiple products can share an SID. The post-decode mapping layer (e.g. Instacart's retailer-partitioned index) resolves SIDs to specific product candidates.
Not embeddings. SIDs are discrete codeword sequences that index into learned embeddings inside the consuming model.
Not surface-specific. A well-designed SID system encodes the catalog once; consuming models on different surfaces (ads, organic, search, post-checkout) can all decode into the same SID space.

Caveats¶

Codebook-stability across re-training is a non-trivial concern not yet well-published. If product X's SID changes between codebook versions, downstream consumers can suffer version skew.
The first-codeword distribution carries disproportionate weight: if the first codebook encodes coarse semantic neighbourhoods unevenly (e.g. one codeword covers half the catalog), beam-search exploration is structurally biased.
Multi-level codebook design is a live research direction — e.g. multi-resolution codebooks, encoding constraints (dietary, brand) in the first codeword level, contrastive regularisation for co-purchased items.

How they're trained: contrastive loss on catalog structure¶

Vanilla RQ-VAE optimizes only reconstruction fidelity — it has no notion of which products should end up near each other. Without structural guidance, it produces fragmented codes (substitutes split into different branches) and error propagation (sparse-text products land badly + the codebook compresses that bad placement).

The fix (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale): add a contrastive loss term that uses the catalog taxonomy as graded supervision. Pair labels:

Pair relationship	Label
Same leaf	Strong positive
Sibling leaf, shared parent	Moderate positive
No shared ancestor	Negative

Loss formula: L_total = L_reconstruction + L_rq + λ · L_contrastive with λ = 0.01 — "a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction." Coarser levels (L1, L2) weighted more heavily within the contrastive term. Companion sampler: hierarchical batch sampling (pick parent → ~half batch from its children → rest from unrelated → multi-sample within each slot). (See concepts/contrastive-regularization-with-catalog-structure + concepts/hierarchical-batch-sampling-for-contrastive-loss + patterns/contrastive-loss-via-taxonomy-tree.)

The catalog-tree-supervision choice is explicit cold-start logic: "using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."

Two flavors via different upstream embeddings¶

The same RQ-VAE + contrastive loss + catalog supervision can be trained against different upstream embeddings, producing two distinct cluster characters:

Precision flavor — domain-specific embedding (e.g. Instacart's ESCI search-relevance model) → tight substitute clusters → substitution / search / reordering.
Discovery flavor — LLM-cleaned attributes (Gemini-Flash-extracted) + off-the-shelf embedding (Gemma) → broader thematic clusters → homepage feeds / cross-sell / exploration.

The architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem." (See concepts/precision-vs-discovery-codebook-flavor + patterns/two-flavor-codebook-precision-vs-discovery + patterns/llm-attribute-extraction-before-embedding.)

Intrinsic evaluation¶

Three complementary metrics evaluate codebooks directly, not just via downstream task metrics:

Metric	What it measures
Similarity-depth correlation	Hierarchy faithfulness; Spearman ρ between embedding similarity and shared-prefix depth
LLM-based cluster evaluation	Functional coherence + purchase likelihood + customer journey relevance
Taxonomy alignment	Whether shared-L1 products share a top-level category (and disagreements as audit signal)

Quote: "Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound." (See patterns/intrinsic-evaluation-of-discrete-codes.)

Catalog-audit dual-use¶

When SIDs disagree with taxonomy labels, the label is often wrong. Examples: a Protein Bar mis-filed under Candy clusters with Sports Nutrition; a Sparkling Water mis-filed under Soda clusters with sparkling waters. Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health." (See concepts/code-vs-label-mismatch-as-catalog-audit + patterns/semantic-code-as-catalog-audit.)

Seen in¶

sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — deep-companion disclosure: SID generation methodology with catalog-tree contrastive regularization, hierarchical batch sampling, two-flavor design (ESCI precision + ESCI+Gemma discovery), intrinsic evaluation suite (similarity-depth correlation 0.69–0.84, LLM cluster evaluation, taxonomy alignment), catalog-audit dual-use, and concrete vocabulary cardinality (~2,000 codeword tokens for the entire catalog at Instacart).
sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — first canonical wiki disclosure: Instacart Semantic IDs (SIDs) as the production vocabulary substrate for the new generative ads retrieval model.

concepts/generative-retrieval — the paradigm Semantic IDs enable.
concepts/atomic-product-id-vs-semantic-id — the substrate trade-off canonicalised.
concepts/vocabulary-bottleneck — the failure mode SIDs solve.
concepts/cold-start — recsys cold-start axis SIDs solve via codebook coverage.
concepts/beam-search-retrieval — the inference primitive that consumes SIDs.
systems/instacart-semantic-ids — production instance.
systems/rq-vae — the algorithm that produces SIDs.
systems/tiger-generative-retrieval — the reference paper.
patterns/rq-vae-codebook-as-product-vocabulary — canonical pattern.