PATTERN Cited by 1 source
Contrastive loss via taxonomy tree¶
Pattern¶
Use a product / item taxonomy tree as the graded supervision signal for a contrastive loss term during representation learning, in place of (or alongside) engagement-derived pair labels. The tree provides three pair classes structurally:
| Pair relationship | Contrastive label |
|---|---|
| Same leaf category | Strong positive — pull together tightly |
| Sibling leaf, shared parent | Moderate positive — pull together moderately |
| No shared ancestor | Negative — push apart |
The pattern composes with deliberate batch construction (hierarchical batch sampling) so each batch contains all three pair classes, ensuring meaningful gradient signal.
Why use a tree (not engagement data)¶
The classical contrastive-recsys recipe uses engagement-derived pairs: co-purchased items, co-viewed items, sequential session items. YouTube's PLUM is the canonical recent example — codebooks aligned to user behavior patterns.
Taxonomy-tree supervision is the cold-start-compatible alternative: the tree exists on day 1 for every item, including items with zero engagement history. The Instacart post is explicit (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):
"Inspired by PLUM's behavioral alignment approach, we added a contrastive term to RQ-VAE training, using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."
The trade-off:
| Supervision source | Strengths | Weaknesses |
|---|---|---|
| Taxonomy tree | Day-1 coverage; cold-start compatible; deterministic; no labeling cost | Misses user-revealed substitutability; encodes catalog-team biases; depends on tree quality |
| Engagement data | Captures user-revealed similarity; cross-category bridges; behavior-aligned | Sparse for new items; popularity-biased; expensive to label / curate |
| Both (PLUM extended with taxonomy, or taxonomy extended with engagement — Instacart's stated future direction) | Complementary; cold-start + behavior-aware | Hyperparameter complexity; potential signal conflicts |
The training-time recipe¶
Three structural pieces:
1. Define pair labels from tree-distance¶
For each pair (a, b) in a training batch, compute the
tree-distance between the leaf categories of a and b:
- Distance 0 (same leaf) → strong positive.
- Distance 1 (sibling leaves, shared parent) → moderate positive.
- Distance > 1 (no shared ancestor up to whatever depth) → negative.
Optionally extend the gradient with intermediate distances (cousin leaves, second-cousin leaves) — Instacart's recipe is the three-class version.
2. Construct batches deliberately¶
Use hierarchical batch sampling: pick a parent category → fill ~half the batch from its children → fill the rest from unrelated categories → sample multiple products per category slot. This guarantees all three pair classes appear in every batch.
3. Add the contrastive term to the primary objective¶
Add λ · L_contrastive to the existing reconstruction / quantization
loss with a small λ weight. Instacart uses λ = 0.01 —
"strong enough to improve coherence, weak enough not to destabilize
reconstruction" — with coarser codebook levels (L1, L2) weighted
more heavily so broad groupings take priority. (See
concepts/reconstruction-vs-semantic-loss-tradeoff.)
Where the pattern fits¶
The pattern is the catalog-supervision-time ingredient of RQ-VAE codebook as product vocabulary. The broader vocabulary-substrate pattern says "replace atomic IDs with codeword sequences"; this pattern says "and use the taxonomy as supervision when training the codebook so codes respect business-meaningful similarity".
It also generalizes beyond RQ-VAE: any contrastive-learning representation (CLIP-style, two-tower retrieval, classifier embeddings) can incorporate taxonomy supervision when:
- A taxonomy exists prior to interaction data.
- The representation is meant to support retrieval / similarity tasks (not just reconstruction).
- Cold-start coverage matters.
Generalization beyond grocery¶
Domains where taxonomy-tree contrastive supervision can replace or augment engagement-derived pairs:
| Domain | Tree | Use case |
|---|---|---|
| E-commerce / grocery | Product category tree | Recsys, search, substitution (Instacart canonical) |
| Media | Genre / sub-genre / mood hierarchies | Content recommendation |
| Library systems | LCC, Dewey | Document retrieval |
| Industry data | NAICS, GICS | Company embedding for search |
| Knowledge graphs | Class hierarchies (DBpedia, Wikidata types) | Entity embeddings |
| Bug tracking | Bug taxonomy / component tree | Bug-similarity search |
| Medical | ICD codes, MeSH | Diagnosis embedding |
Caveats¶
- Taxonomy quality is an upper bound. Mislabeled items in the tree (Instacart's own Protein-Bar-in-Candy and Sparkling-Water-in-Soda examples) carry noise into the contrastive signal. The codebook ends up partially encoding taxonomy bugs along with semantics.
- Same-leaf == positive is a simplification. Two items in the same leaf may differ in price, brand, or use-case; the loss treats them as identical positives.
- Tree depth and balance matter. Shallow trees give only same-leaf vs different-leaf signal; unbalanced trees produce inconsistent sibling-pair semantics.
- Engagement-blind by design. This pattern alone cannot capture user-revealed cross-category substitutes (e.g. bread-crumbs vs panko if they're in different leaves). Engagement-signal augmentation is the explicit Instacart next-step.
λweight is workload-specific. Calibrating to a new domain requires hyperparameter tuning; the 0.01 value is a domain-specific data point, not a universal default.- No publicly disclosed ablation against baselines. Instacart asserts the pattern works but doesn't publish reconstruction-only vs reconstruction+contrastive ablation numbers.
Seen in¶
- sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale
— canonical wiki instance: catalog-tree contrastive
regularization for Instacart's RQ-VAE codebook training, with
loss formula
L_total = L_reconstruction + L_rq + λ · L_contrastiveatλ = 0.01, three-class pair labelling (same-leaf strong + / sibling-leaf mod + / no-shared-ancestor −), and hierarchical batch sampling. Explicitly framed as the cold-start-compatible alternative to PLUM's engagement-data approach.
Related¶
- concepts/contrastive-regularization-with-catalog-structure — the concept this pattern instantiates.
- concepts/hierarchical-batch-sampling-for-contrastive-loss — the companion technique.
- concepts/reconstruction-vs-semantic-loss-tradeoff — the multi-objective balance.
- concepts/semantic-id / concepts/cold-start — supporting concepts.
- systems/rq-vae / systems/instacart-semantic-ids — algorithmic substrate + production instance.
- patterns/rq-vae-codebook-as-product-vocabulary — the broader vocabulary-substrate pattern this is the training-time ingredient of.