CONCEPT Cited by 1 source
Contrastive regularization with catalog structure¶
Definition¶
Contrastive regularization with catalog structure is the technique of adding a contrastive-loss term to a representation-learning objective (e.g. an RQ-VAE codebook learner) that uses a product-catalog taxonomy — not user-engagement data — as the graded supervision signal for what counts as similar and dissimilar.
A contrastive loss pulls similar items closer in the learned representation and pushes dissimilar items apart. The choice of similarity signal is the load-bearing design lever: classical contrastive learning uses engagement-derived pairs (co-purchased, co-viewed, co-clicked); catalog-structure contrastive learning uses taxonomy-distance:
| Pair relationship in taxonomy | Contrastive label |
|---|---|
| Same leaf category (two marinara sauces) | Strong positive |
| Sibling leaf, shared parent (marinara + alfredo, both under "Pasta & Pizza Sauces") | Moderate positive |
| No shared ancestor ("Pasta Sauce" vs "Office Supplies") | Negative |
Quote (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):
"Rather than binary same/different labels, we define relatedness along a gradient based on where two products sit in the catalog tree. … The signal isn't relative to any single product; it's defined by the structural distance between any pair in the taxonomy."
Why catalog structure (not engagement data)¶
The dominant published shape of contrastive recsys learning uses engagement data — co-purchase pairs, sequential session pairs, collaborative-filtering-derived similarity. YouTube's PLUM is the canonical recent example of behavior-aligned codebook training.
Catalog-structure supervision is the cold-start-compatible alternative: the catalog tree exists for every product on day 1, including products with zero engagement history. Quote:
"using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."
The post is explicit that this is a deliberate trade — engagement data carries information the taxonomy doesn't (user-revealed substitutability, complementarity), and Instacart names extending beyond catalog-structure-only contrastive signal to also leverage behavioral signal as future work.
What it fixes (in RQ-VAE training)¶
A vanilla RQ-VAE optimizes only reconstruction fidelity — the quantizer's job is to compress the embedding such that decoding reproduces it. It has no notion of which products should end up near each other in the codebook. Without structural guidance, the quantizer produces two distinct failure modes (Source: same):
- Fragmentation — "two marinara sauces that any customer would consider substitutes end up in different branches". The codebook does not respect substitution semantics.
- Error propagation — "a product with product details, category and descriptions gets embedded poorly and placed among irrelevant items". Sparse-text products land badly + the codebook faithfully compresses that bad placement.
The contrastive term is the architectural fix: it pulls catalog-related items together in the codebook space and pushes unrelated items apart, biasing the quantizer toward business-meaningful clusters even when the upstream embedding alone wouldn't.
Loss formula¶
The Instacart RQ-VAE training objective is:
where:
L_reconstruction— the autoencoder reconstruction term.L_rq— the RQ-VAE residual-quantization commitment loss.L_contrastive— the catalog-structure contrastive term defined over taxonomy-distance pair labels.λ = 0.01— "a gentle regularizer: strong enough to improve coherence, weak enough not to destabilize reconstruction."
The contrastive term aligns embedding similarity with codebook-index similarity across all four codebook levels, with coarser levels (L1, L2) weighted more heavily so broad groupings take priority. This is the architectural reason a generated SID's first codeword encodes a coarse semantic neighborhood and successive codewords narrow within it. (See concepts/reconstruction-vs-semantic-loss-tradeoff for the balance.)
Why it requires hierarchical batch sampling¶
A contrastive loss is mute on a batch of unrelated items: there are no positives to pull together. With random sampling over millions of catalog items, "most batches would be entirely unrelated — the loss would have no positive signal to learn from." The companion technique hierarchical batch sampling constructs each batch deliberately — pick a parent category → fill ~half from its children → fill rest from unrelated categories → sample multiple products per slot — so each batch naturally contains same-leaf, sibling-leaf, and unrelated pairs.
Generalization beyond Instacart¶
The pattern generalizes wherever a structured taxonomy of items exists prior to interaction data:
- E-commerce catalogs with category trees (Instacart canonical instance).
- Media catalogs with genre/sub-genre hierarchies (films, TV, music).
- Library systems with subject classifications (LCC, Dewey).
- Knowledge graphs with class hierarchies.
- Job listings with role/level taxonomies.
The substrate-agnostic insight: taxonomy is a free, dense, day-1 supervision signal for representation learning where engagement data is sparse, biased, or absent.
Caveats¶
- Taxonomy quality is an upper bound. If the catalog tree itself has mislabels, the contrastive signal carries the noise into the codebook. Instacart's own post-hoc finding — that SIDs surface catalog mislabels (Protein Bar in Candy, Sparkling Water in Soda) — illustrates that the taxonomy is imperfect, even though it's load-bearing for SID training. (See concepts/code-vs-label-mismatch-as-catalog-audit.)
- Same-leaf == positive is a simplification. Two products in the same leaf may be functionally distinct (different brands, sizes, formats); the loss treats them as identical positives. Whether intra-leaf differentiation should also be supervised is open.
- Engagement-signal blindness. Catalog-structure contrastive learning misses user-revealed substitutability — two products that are taxonomically distant but functionally substitutable (bread crumbs vs panko) get pushed apart. Behavioral-signal augmentation (PLUM-style) is the extension direction.
- Hyperparameter
λrequires tuning. Too small → no contrastive effect; too large → reconstruction destabilizes. Instacart settled onλ = 0.01; this is hyperparameter-specific to their RQ-VAE + embedding shape.
Seen in¶
- sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale
— first canonical wiki disclosure: catalog-tree contrastive
regularization for Instacart Semantic IDs RQ-VAE training, with
loss formula
L_total = L_reconstruction + L_rq + λ · L_contrastiveatλ = 0.01, and explicit positioning as the cold-start-compatible alternative to PLUM's engagement-data approach.
Related¶
- concepts/hierarchical-batch-sampling-for-contrastive-loss — the companion technique; without it the contrastive loss has no positive signal in random batches.
- concepts/reconstruction-vs-semantic-loss-tradeoff — the
balance the
λterm controls. - concepts/semantic-id — the substrate this technique improves.
- concepts/cold-start — the axis this technique addresses (works for products with zero engagement history).
- systems/rq-vae — the algorithm extended.
- systems/instacart-semantic-ids — production instance.
- patterns/contrastive-loss-via-taxonomy-tree — the canonical pattern.
- patterns/rq-vae-codebook-as-product-vocabulary — the broader pattern this technique is a training-time ingredient of.