Skip to content

CONCEPT Cited by 2 sources

Semantic ID

Definition

A Semantic ID is a discrete-token identifier for a recommendable item, encoded as a short sequence of codewords from a learned hierarchical codebook, where semantically similar items share codeword prefixes. Semantic IDs are the vocabulary substrate that makes generative retrieval economical and structurally sensible.

The canonical example (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

SID Product
35_7_119_493 Organic Good Seed Thin Sliced
35_7_120_184 Artisanal Italian Bread
35_7_120_185 Classic Italian Bread

Shared 35_7_… prefix = bread / bakery semantic neighbourhood. Shared 35_7_120_… prefix = Italian-bread sub-category.

Three load-bearing properties

A Semantic ID substrate has three properties that together justify the substrate change away from atomic item IDs:

  1. Coverage to every item, regardless of history. New items map to existing codewords from day 1. This addresses recsys cold-start for new products without requiring transaction history.
  2. Generalisation over memorisation. Models trained on Semantic IDs learn over the codeword space, not over individual product IDs — they generalise rather than overfit co-occurrence patterns.
  3. Embedding-parameter compression. The embedding table sized to the codebook union is much smaller than one sized to the catalog. Instacart reports a 125× reduction in embedding parameter space.

Required substrate: hierarchical codebook

Semantic IDs require a codebook structure that produces shared prefixes for semantically similar items. The canonical algorithm is RQ-VAE (Residual Quantized VAE) which trains K codebooks where each captures the residual of the previous, yielding a coarse-to-fine hierarchy: first codeword = coarse semantic neighbourhood, last codeword = fine-grained distinguisher.

The hierarchical property is load-bearing for generative retrieval: when an autoregressive decoder generates the first codeword, it is choosing a coarse semantic neighbourhood; subsequent codewords narrow within that neighbourhood. Beam search at each step explores semantically meaningful options rather than across the entire catalog.

Where Semantic IDs sit in the substrate-design space

Three discrete vocabulary substrates for recsys retrieval:

Substrate Vocabulary size Prefix structure Cold-start Used for
Atomic item IDs Catalog-bounded None Hard Sequence-model scoring (BERT-like CR, GRU4Rec, SASRec)
Embeddings (continuous) N/A — real vectors N/A Easy if learned from features Two-tower / ANN retrieval
Semantic IDs Codebook-bounded Hierarchical Easy Generative retrieval (TIGER lineage)

Semantic IDs are the bridge between atomic IDs (discrete enough to be the output of an autoregressive decoder, but with vocabulary bottleneck) and embeddings (rich enough to encode similarity, but continuous so they need ANN search not generation).

What Semantic IDs are NOT

  • Not unique product identifiers. Multiple products can share an SID. The post-decode mapping layer (e.g. Instacart's retailer-partitioned index) resolves SIDs to specific product candidates.
  • Not embeddings. SIDs are discrete codeword sequences that index into learned embeddings inside the consuming model.
  • Not surface-specific. A well-designed SID system encodes the catalog once; consuming models on different surfaces (ads, organic, search, post-checkout) can all decode into the same SID space.

Caveats

  • Codebook-stability across re-training is a non-trivial concern not yet well-published. If product X's SID changes between codebook versions, downstream consumers can suffer version skew.
  • The first-codeword distribution carries disproportionate weight: if the first codebook encodes coarse semantic neighbourhoods unevenly (e.g. one codeword covers half the catalog), beam-search exploration is structurally biased.
  • Multi-level codebook design is a live research direction — e.g. multi-resolution codebooks, encoding constraints (dietary, brand) in the first codeword level, contrastive regularisation for co-purchased items.

How they're trained: contrastive loss on catalog structure

Vanilla RQ-VAE optimizes only reconstruction fidelity — it has no notion of which products should end up near each other. Without structural guidance, it produces fragmented codes (substitutes split into different branches) and error propagation (sparse-text products land badly + the codebook compresses that bad placement).

The fix (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale): add a contrastive loss term that uses the catalog taxonomy as graded supervision. Pair labels:

Pair relationship Label
Same leaf Strong positive
Sibling leaf, shared parent Moderate positive
No shared ancestor Negative

Loss formula: L_total = L_reconstruction + L_rq + λ · L_contrastive with λ = 0.01"a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction." Coarser levels (L1, L2) weighted more heavily within the contrastive term. Companion sampler: hierarchical batch sampling (pick parent → ~half batch from its children → rest from unrelated → multi-sample within each slot). (See concepts/contrastive-regularization-with-catalog-structure + concepts/hierarchical-batch-sampling-for-contrastive-loss + patterns/contrastive-loss-via-taxonomy-tree.)

The catalog-tree-supervision choice is explicit cold-start logic: "using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."

Two flavors via different upstream embeddings

The same RQ-VAE + contrastive loss + catalog supervision can be trained against different upstream embeddings, producing two distinct cluster characters:

  • Precision flavor — domain-specific embedding (e.g. Instacart's ESCI search-relevance model) → tight substitute clusters → substitution / search / reordering.
  • Discovery flavor — LLM-cleaned attributes (Gemini-Flash-extracted) + off-the-shelf embedding (Gemma) → broader thematic clusters → homepage feeds / cross-sell / exploration.

The architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem." (See concepts/precision-vs-discovery-codebook-flavor + patterns/two-flavor-codebook-precision-vs-discovery + patterns/llm-attribute-extraction-before-embedding.)

Intrinsic evaluation

Three complementary metrics evaluate codebooks directly, not just via downstream task metrics:

Metric What it measures
Similarity-depth correlation Hierarchy faithfulness; Spearman ρ between embedding similarity and shared-prefix depth
LLM-based cluster evaluation Functional coherence + purchase likelihood + customer journey relevance
Taxonomy alignment Whether shared-L1 products share a top-level category (and disagreements as audit signal)

Quote: "Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound." (See patterns/intrinsic-evaluation-of-discrete-codes.)

Catalog-audit dual-use

When SIDs disagree with taxonomy labels, the label is often wrong. Examples: a Protein Bar mis-filed under Candy clusters with Sports Nutrition; a Sparkling Water mis-filed under Soda clusters with sparkling waters. Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health." (See concepts/code-vs-label-mismatch-as-catalog-audit + patterns/semantic-code-as-catalog-audit.)

Seen in

Last updated · 542 distilled / 1,571 read