PATTERN Cited by 1 source

Contrastive loss via taxonomy tree¶

Pattern¶

Use a product / item taxonomy tree as the graded supervision signal for a contrastive loss term during representation learning, in place of (or alongside) engagement-derived pair labels. The tree provides three pair classes structurally:

Pair relationship	Contrastive label
Same leaf category	Strong positive — pull together tightly
Sibling leaf, shared parent	Moderate positive — pull together moderately
No shared ancestor	Negative — push apart

The pattern composes with deliberate batch construction (hierarchical batch sampling) so each batch contains all three pair classes, ensuring meaningful gradient signal.

Why use a tree (not engagement data)¶

The classical contrastive-recsys recipe uses engagement-derived pairs: co-purchased items, co-viewed items, sequential session items. YouTube's PLUM is the canonical recent example — codebooks aligned to user behavior patterns.

Taxonomy-tree supervision is the cold-start-compatible alternative: the tree exists on day 1 for every item, including items with zero engagement history. The Instacart post is explicit (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"Inspired by PLUM's behavioral alignment approach, we added a contrastive term to RQ-VAE training, using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."

The trade-off:

Supervision source	Strengths	Weaknesses
Taxonomy tree	Day-1 coverage; cold-start compatible; deterministic; no labeling cost	Misses user-revealed substitutability; encodes catalog-team biases; depends on tree quality
Engagement data	Captures user-revealed similarity; cross-category bridges; behavior-aligned	Sparse for new items; popularity-biased; expensive to label / curate
Both (PLUM extended with taxonomy, or taxonomy extended with engagement — Instacart's stated future direction)	Complementary; cold-start + behavior-aware	Hyperparameter complexity; potential signal conflicts

The training-time recipe¶

Three structural pieces:

1. Define pair labels from tree-distance¶

For each pair (a, b) in a training batch, compute the tree-distance between the leaf categories of a and b:

Distance 0 (same leaf) → strong positive.
Distance 1 (sibling leaves, shared parent) → moderate positive.
Distance > 1 (no shared ancestor up to whatever depth) → negative.

Optionally extend the gradient with intermediate distances (cousin leaves, second-cousin leaves) — Instacart's recipe is the three-class version.

2. Construct batches deliberately¶

Use hierarchical batch sampling: pick a parent category → fill ~half the batch from its children → fill the rest from unrelated categories → sample multiple products per category slot. This guarantees all three pair classes appear in every batch.

3. Add the contrastive term to the primary objective¶

Add λ · L_contrastive to the existing reconstruction / quantization loss with a small λ weight. Instacart uses λ = 0.01 — "strong enough to improve coherence, weak enough not to destabilize reconstruction" — with coarser codebook levels (L1, L2) weighted more heavily so broad groupings take priority. (See concepts/reconstruction-vs-semantic-loss-tradeoff.)

Where the pattern fits¶

The pattern is the catalog-supervision-time ingredient of RQ-VAE codebook as product vocabulary. The broader vocabulary-substrate pattern says "replace atomic IDs with codeword sequences"; this pattern says "and use the taxonomy as supervision when training the codebook so codes respect business-meaningful similarity".

It also generalizes beyond RQ-VAE: any contrastive-learning representation (CLIP-style, two-tower retrieval, classifier embeddings) can incorporate taxonomy supervision when:

A taxonomy exists prior to interaction data.
The representation is meant to support retrieval / similarity tasks (not just reconstruction).
Cold-start coverage matters.

Generalization beyond grocery¶

Domains where taxonomy-tree contrastive supervision can replace or augment engagement-derived pairs:

Domain	Tree	Use case
E-commerce / grocery	Product category tree	Recsys, search, substitution (Instacart canonical)
Media	Genre / sub-genre / mood hierarchies	Content recommendation
Library systems	LCC, Dewey	Document retrieval
Industry data	NAICS, GICS	Company embedding for search
Knowledge graphs	Class hierarchies (DBpedia, Wikidata types)	Entity embeddings
Bug tracking	Bug taxonomy / component tree	Bug-similarity search
Medical	ICD codes, MeSH	Diagnosis embedding

Caveats¶

Taxonomy quality is an upper bound. Mislabeled items in the tree (Instacart's own Protein-Bar-in-Candy and Sparkling-Water-in-Soda examples) carry noise into the contrastive signal. The codebook ends up partially encoding taxonomy bugs along with semantics.
Same-leaf == positive is a simplification. Two items in the same leaf may differ in price, brand, or use-case; the loss treats them as identical positives.
Tree depth and balance matter. Shallow trees give only same-leaf vs different-leaf signal; unbalanced trees produce inconsistent sibling-pair semantics.
Engagement-blind by design. This pattern alone cannot capture user-revealed cross-category substitutes (e.g. bread-crumbs vs panko if they're in different leaves). Engagement-signal augmentation is the explicit Instacart next-step.
λ weight is workload-specific. Calibrating to a new domain requires hyperparameter tuning; the 0.01 value is a domain-specific data point, not a universal default.
No publicly disclosed ablation against baselines. Instacart asserts the pattern works but doesn't publish reconstruction-only vs reconstruction+contrastive ablation numbers.

Seen in¶

sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — canonical wiki instance: catalog-tree contrastive regularization for Instacart's RQ-VAE codebook training, with loss formula L_total = L_reconstruction + L_rq + λ · L_contrastive at λ = 0.01, three-class pair labelling (same-leaf strong + / sibling-leaf mod + / no-shared-ancestor −), and hierarchical batch sampling. Explicitly framed as the cold-start-compatible alternative to PLUM's engagement-data approach.

concepts/contrastive-regularization-with-catalog-structure — the concept this pattern instantiates.
concepts/hierarchical-batch-sampling-for-contrastive-loss — the companion technique.
concepts/reconstruction-vs-semantic-loss-tradeoff — the multi-objective balance.
concepts/semantic-id / concepts/cold-start — supporting concepts.
systems/rq-vae / systems/instacart-semantic-ids — algorithmic substrate + production instance.
patterns/rq-vae-codebook-as-product-vocabulary — the broader vocabulary-substrate pattern this is the training-time ingredient of.