PATTERN Cited by 2 sources
RQ-VAE codebook as product vocabulary¶
Pattern¶
Replace atomic item IDs as the recsys vocabulary substrate with
short codeword sequences from an RQ-VAE-learned
hierarchical codebook. Each item is encoded as K codeword
indices (typically 4) where each successive codebook captures the
residual of the previous, producing a coarse-to-fine hierarchy
where semantically similar items share early codewords.
Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):
"Instacart Semantic IDs, SIDs, replace atomic product IDs with short sequences of codewords generated by an RQ-VAE. A product's SID looks like
35_7_120_184: four tokens from learned codebooks at different granularity levels."
Three structural pieces¶
1. RQ-VAE codebook training¶
Train an RQ-VAE on item features (not historical interactions).
The encoder produces a continuous embedding for each item; the
quantiser stage produces K codeword indices via successive
residual quantisation. The output codebook captures the
feature-space hierarchy of the catalog.
Why train on item features (not interactions)? - New items have features but no interactions — features-only training enables cold-start coverage. - Feature-space hierarchy is more stable than interaction-derived hierarchy across catalog churn.
2. Encoding every item into the codebook¶
Run every item through the trained RQ-VAE encoder + quantiser to
produce its K-token Semantic ID. Store the (item_id, SID)
mapping. Fix it as the canonical vocabulary substrate.
Important properties:
- Coverage to every item — including new items added after codebook training (just encode their features through the trained model).
- Multiple items can share an SID — the codebook is a lossy compression. The post-decode mapping layer (e.g. concepts/retailer-partitioned-index) resolves shared SIDs to individual products.
- Prefix-sharing semantic similarity — Instacart's three-product
example:
35_7_119_493(Organic Good Seed Thin Sliced) /35_7_120_184(Artisanal Italian Bread) /35_7_120_185(Classic Italian Bread) all share35_7_…(bread / bakery); the latter two share35_7_120_…(Italian-bread).
3. Consuming the codebook in downstream models¶
Train recsys models with codewords as tokens, not item IDs as
tokens. The embedding table is sized to the codebook union (small),
not the catalog (huge). The model output space is K codewords per
prediction, not one atomic ID.
The Instacart 2026-06 source reports a 125× reduction in embedding parameter space from this substrate change alone.
When this pattern composes¶
The pattern is the vocabulary-substrate ingredient in the broader generative-over- scoring-retrieval pattern. It composes with:
- Generative retrieval — autoregressive decoder generates SIDs token-by-token; the hierarchical-codebook prefix property makes beam search semantically meaningful at every step.
- Beam search retrieval — the inference primitive that consumes the codebook.
- Retailer-partitioned index — the post-decode mapping layer that resolves SIDs to candidate items.
It also has standalone value for systems that don't go all the way to generative retrieval — e.g. embedding-table compression in scoring retrieval models.
Why the hierarchy matters¶
The hierarchical-codebook property is not optional for the generative-retrieval consumer:
- A flat (non-hierarchical) codebook would still compress the vocabulary, but prefix sharing wouldn't be semantically meaningful. Beam search would explore arbitrary codeword combinations rather than narrowing within semantic neighbourhoods.
- The autoregressive prefix-conditioning property — "if the model begins generating a prefix for 'Produce,' the beam search remains confined to that semantic neighborhood" — depends entirely on the codebook hierarchy.
This is what differentiates RQ-VAE from plain VQ-VAE for recsys: the residual structure produces the hierarchy organically.
Design-space dimensions¶
The Instacart source explicitly identifies design-space directions not yet shipped:
- Multi-resolution codebooks — codebooks at multiple granularity scales beyond the current 4-codeword shape.
- Co-occurrence contrastive regularisation — training-time loss to push together SIDs of co-purchased items and apart SIDs of unrelated items. Bridges feature-space and interaction-space.
- Constraint-encoding in early codewords — encoding attributes (dietary: vegan/gluten-free; brand-tier; price-band) into the first codeword position so beam search can be constrained at the earliest decoding step.
Training-time extensions (2026-06-02 deep-companion)¶
The 2026-06-02 Semantic IDs: Product Understanding at Scale post adds three load-bearing training-time extensions to the bare RQ-VAE recipe (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):
1. Catalog-tree contrastive regularization¶
Vanilla RQ-VAE optimizes only reconstruction fidelity, producing two failure modes: fragmentation (substitutes split into different branches) and error propagation (sparse-text products land badly + the codebook compresses that bad placement).
The fix: add a contrastive loss term using the catalog taxonomy as graded supervision:
with λ = 0.01 and coarser codebook levels weighted more heavily.
(See patterns/contrastive-loss-via-taxonomy-tree +
concepts/contrastive-regularization-with-catalog-structure.)
The catalog-tree-supervision choice is the cold-start-compatible alternative to PLUM-style engagement-data supervision: the tree exists for new products, engagement data doesn't.
2. Hierarchical batch sampling¶
The contrastive loss requires positive signal in every batch. Random sampling over millions of items would produce unrelated batches with no positive signal. The fix: construct batches deliberately — pick a parent → ~half batch from its children → rest from unrelated → multi-sample within each slot. (See concepts/hierarchical-batch-sampling-for-contrastive-loss.)
3. Two flavors via different upstream embeddings¶
Same RQ-VAE skeleton + contrastive loss + catalog supervision, different upstream embedding = different cluster character. This is the design axis canonicalised in patterns/two-flavor-codebook-precision-vs-discovery:
| Flavor | Upstream | Cluster character | Use cases |
|---|---|---|---|
| Precision (Instacart: ESCI) | Domain-specific search-relevance embedding | Tight substitute clusters | Substitution / search / reordering |
| Discovery (Instacart: ESCI+Gemma) | LLM-cleaned attributes + off-the-shelf embedding | Broader thematic clusters | Homepage / cross-sell / exploration |
The architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem."
Production-disclosed properties¶
From Instacart's 2026-06 disclosures:
- ~2,000 codeword tokens represent the entire catalog (4 hierarchical codebooks).
- 125× embedding-parameter compression vs atomic-ID embedding table.
- Spearman 0.69–0.84 similarity-depth correlation in production codebooks (intrinsic-evaluation datum; see patterns/intrinsic-evaluation-of-discrete-codes).
- Carousel A/B: +34% add-to-carts, 2.7× more emerging brands surfaced.
- Tail-category lifts in the consumer ads retriever: +421% Alcohol / +396% Beverages / +229% Healthcare diversity.
Emergent dual-use: catalog audit¶
When SIDs disagree with taxonomy labels, the label is often wrong — Instacart turned this disagreement into an automated catalog-audit pipeline. Quote: "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health." (See patterns/semantic-code-as-catalog-audit + concepts/code-vs-label-mismatch-as-catalog-audit.)
Caveats¶
- Codebook stability across retraining is a non-trivial concern
— if product
X's SID changes between codebook versions, downstream consumers suffer version skew. Industry practice for managing this not well-published. - First-codeword skew — if the first codebook concentrates semantic neighbourhoods unevenly (one codeword covers half the catalog), beam-search exploration is structurally biased.
- Item features must be rich for RQ-VAE to learn meaningful hierarchy. Items without good feature representations (sparse, abstract, low-cardinality) produce arbitrary codebooks.
- Codebook size is a hyperparameter trade-off — too small compresses too aggressively (loses item-level distinguishability); too large doesn't escape the vocabulary bottleneck.
Seen in¶
- sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — deep-companion training-methodology disclosure: catalog-tree contrastive regularization, hierarchical batch sampling, two-flavor design, intrinsic-evaluation methodology, ~2,000 codeword tokens for the entire catalog, +34% add-to-carts + 2.7× emerging brands, catalog-audit dual-use.
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — Instacart Semantic IDs as the production substrate behind generative ads retrieval (the consumer side).
- systems/tiger-generative-retrieval — TIGER paper (Google DeepMind, NeurIPS 2023) — academic origin of the pattern for recsys; production at Spotify (GLIDE/NEO), YouTube (PLUM).
Related¶
- concepts/semantic-id / concepts/atomic-product-id-vs-semantic-id / concepts/vocabulary-bottleneck — supporting concepts.
- systems/rq-vae — algorithmic substrate.
- systems/instacart-semantic-ids / systems/tiger-generative-retrieval / systems/instacart-generative-ads-retrieval — production instances.
- patterns/generative-over-scoring-retrieval — broader pattern this is the vocabulary-substrate ingredient of.