Skip to content

SYSTEM Cited by 2 sources

RQ-VAE

Definition

RQ-VAEResidual Quantized Variational Autoencoder — is a generative-model architecture that learns a hierarchical codebook for discretising continuous embeddings into short sequences of codeword indices. Each input is encoded as K codeword indices drawn from K learned codebooks, where each successive codebook captures the residual (the part not yet captured by previous codebooks).

input embedding e ∈ R^d
codebook_1: argmin_c ||e - c||²       → token_1
residual_1 = e - codebook_1[token_1]
codebook_2: argmin_c ||residual_1 - c||² → token_2
residual_2 = residual_1 - codebook_2[token_2]
... continue for K levels ...
output: (token_1, token_2, ..., token_K) — the "Semantic ID"

Where it differs from plain Vector Quantization (VQ-VAE): plain VQ-VAE uses a single codebook and a single token; RQ-VAE stacks K codebooks where each captures the residual of the previous. The resulting K-token sequence is shorter than would be needed by a single flat codebook of equivalent representational capacity, and — critically for generative recsys — products with similar embeddings share early tokens, giving the prefix-sharing semantic similarity property that Semantic IDs depend on.

Why it shows up on the wiki

Disclosed as the algorithm behind Instacart Semantic IDs (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"Instacart Semantic IDs, SIDs, replace atomic product IDs with short sequences of codewords generated by an RQ-VAE. A product's SID looks like 35_7_120_184: four tokens from learned codebooks at different granularity levels."

RQ-VAE is the load-bearing piece of the TIGER paper that makes generative retrieval over the catalog vocabulary economical.

Hierarchical-codebook property — the key recsys win

The post discloses that the four codebooks are "at different granularity levels". The recsys consequence: when the generative-retrieval decoder produces the first codeword, it is choosing a coarse semantic neighbourhood (e.g. bakery); the second narrows it (e.g. bread); the third narrows further (e.g. Italian bread); the fourth distinguishes individual SKUs. Beam search at each step is therefore choosing among semantically meaningful options rather than across the full catalog vocabulary.

The 2026-06-02 source's three-product prefix example demonstrates this in production:

SID Product
35_7_119_493 Organic Good Seed Thin Sliced
35_7_120_184 Artisanal Italian Bread
35_7_120_185 Classic Italian Bread

Shared 35_7_… prefix = bread / bakery semantic neighbourhood. Shared 35_7_120_… prefix = Italian-bread sub-category.

Why this is a non-trivial alternative to embeddings

Plain item embeddings (the conventional recsys vocabulary substrate) are continuous and require Approximate Nearest Neighbour search (concepts/ann-index) to retrieve. Item-as-discrete-token substrates (the GRU4Rec / SASRec lineage) require a vocabulary the size of the catalog, hitting the vocabulary bottleneck.

RQ-VAE-derived Semantic IDs thread the needle: discrete (so they can be the output of an autoregressive decoder) but with shared substructure (so the embedding parameter space scales with the codebook size, not the catalog size — Instacart reports a 125× reduction in embedding parameter space).

Caveats

  • This is a stub page capturing RQ-VAE as the algorithmic substrate behind SIDs. Original-paper-level architectural detail (encoder/decoder shape, training objective, codebook update scheme, dead-codeword handling) is not reproduced here; future ingest of the Instacart companion Semantic IDs: Product Understanding at Scale post would deepen this.
  • The TIGER paper (Rajput et al., NeurIPS 2023) is the load-bearing reference for the recsys application of RQ-VAE.

Instacart's training methodology (2026-06-02 deep-companion)

The 2026-06-02 Semantic IDs: Product Understanding at Scale post discloses Instacart's RQ-VAE training methodology with two key extensions over vanilla RQ-VAE:

Catalog-tree contrastive regularization

Vanilla RQ-VAE optimizes only reconstruction fidelity. Without structural guidance, two failure modes appear (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

  • Fragmentation"two marinara sauces that any customer would consider substitutes end up in different branches".
  • Error propagation"a product with product details, category and descriptions gets embedded poorly and placed among irrelevant items."

The fix: add a contrastive loss term using the catalog taxonomy as graded supervision (see concepts/contrastive-regularization-with-catalog-structure + patterns/contrastive-loss-via-taxonomy-tree). Loss formula:

L_total = L_reconstruction + L_rq + λ · L_contrastive

with λ = 0.01"a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction" — and coarser codebook levels (L1, L2) weighted more heavily within the contrastive term so broad groupings take priority. (See concepts/reconstruction-vs-semantic-loss-tradeoff.)

Hierarchical batch sampling

The contrastive loss requires each batch to contain same-leaf, sibling-leaf, and unrelated pairs. Random sampling over millions of items would produce no positive signal. The fix: deliberate batch construction — pick a parent category → fill ~half batch with its children → fill rest with unrelated categories → multi-sample within each category slot. (See concepts/hierarchical-batch-sampling-for-contrastive-loss.)

Two flavors via different upstream embeddings

Instacart trains the same RQ-VAE + contrastive loss + catalog supervision against two different upstream embedding substrates (see patterns/two-flavor-codebook-precision-vs-discovery):

  • ESCI (precision) — domain-specific search-relevance embedding → tight substitute clusters.
  • ESCI+Gemma (discovery) — Gemini-Flash-cleaned attributes → off-the-shelf Gemma embedding → broader thematic clusters.

Architectural insight: "The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem."

Disclosed cardinality

~2,000 codeword tokens represent Instacart's entire catalog across 4 hierarchical codebooks. This is the concrete vocabulary size that escapes the catalog-bounded vocabulary of atomic product IDs — small enough to make autoregressive generative decoding economical.

Cluster character (production examples)

Under SID prefix 6_19_:

  • 6_19_32 — Italian cheeses (Parmigiano, Pecorino, Mozzarella, Ricotta).
  • 6_19_24 — Specialty cheeses (Brie, Manchego, Halloumi, Goat cheese).
  • 6_19_12 — Olives (Castelvetrano, Kalamata, olive medleys).
  • 6_19_7 — Tapenades (olive tapenade, spreads).
  • 6_19_9 — Deli trays and dips.
  • 6_19_14 — Croutons.

Quote: "No one wrote a rule connecting Pecorino Romano to Kalamata olives to olive tapenade. The model learned that these products inhabit the same culinary universe… by compressing their embeddings into codes that share a prefix."

Seen in

Last updated · 542 distilled / 1,571 read