Skip to content

PATTERN Cited by 1 source

Contrastive loss via taxonomy tree

Pattern

Use a product / item taxonomy tree as the graded supervision signal for a contrastive loss term during representation learning, in place of (or alongside) engagement-derived pair labels. The tree provides three pair classes structurally:

Pair relationship Contrastive label
Same leaf category Strong positive — pull together tightly
Sibling leaf, shared parent Moderate positive — pull together moderately
No shared ancestor Negative — push apart

The pattern composes with deliberate batch construction (hierarchical batch sampling) so each batch contains all three pair classes, ensuring meaningful gradient signal.

Why use a tree (not engagement data)

The classical contrastive-recsys recipe uses engagement-derived pairs: co-purchased items, co-viewed items, sequential session items. YouTube's PLUM is the canonical recent example — codebooks aligned to user behavior patterns.

Taxonomy-tree supervision is the cold-start-compatible alternative: the tree exists on day 1 for every item, including items with zero engagement history. The Instacart post is explicit (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"Inspired by PLUM's behavioral alignment approach, we added a contrastive term to RQ-VAE training, using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)."

The trade-off:

Supervision source Strengths Weaknesses
Taxonomy tree Day-1 coverage; cold-start compatible; deterministic; no labeling cost Misses user-revealed substitutability; encodes catalog-team biases; depends on tree quality
Engagement data Captures user-revealed similarity; cross-category bridges; behavior-aligned Sparse for new items; popularity-biased; expensive to label / curate
Both (PLUM extended with taxonomy, or taxonomy extended with engagement — Instacart's stated future direction) Complementary; cold-start + behavior-aware Hyperparameter complexity; potential signal conflicts

The training-time recipe

Three structural pieces:

1. Define pair labels from tree-distance

For each pair (a, b) in a training batch, compute the tree-distance between the leaf categories of a and b:

  • Distance 0 (same leaf) → strong positive.
  • Distance 1 (sibling leaves, shared parent) → moderate positive.
  • Distance > 1 (no shared ancestor up to whatever depth) → negative.

Optionally extend the gradient with intermediate distances (cousin leaves, second-cousin leaves) — Instacart's recipe is the three-class version.

2. Construct batches deliberately

Use hierarchical batch sampling: pick a parent category → fill ~half the batch from its children → fill the rest from unrelated categories → sample multiple products per category slot. This guarantees all three pair classes appear in every batch.

3. Add the contrastive term to the primary objective

Add λ · L_contrastive to the existing reconstruction / quantization loss with a small λ weight. Instacart uses λ = 0.01"strong enough to improve coherence, weak enough not to destabilize reconstruction" — with coarser codebook levels (L1, L2) weighted more heavily so broad groupings take priority. (See concepts/reconstruction-vs-semantic-loss-tradeoff.)

Where the pattern fits

The pattern is the catalog-supervision-time ingredient of RQ-VAE codebook as product vocabulary. The broader vocabulary-substrate pattern says "replace atomic IDs with codeword sequences"; this pattern says "and use the taxonomy as supervision when training the codebook so codes respect business-meaningful similarity".

It also generalizes beyond RQ-VAE: any contrastive-learning representation (CLIP-style, two-tower retrieval, classifier embeddings) can incorporate taxonomy supervision when:

  • A taxonomy exists prior to interaction data.
  • The representation is meant to support retrieval / similarity tasks (not just reconstruction).
  • Cold-start coverage matters.

Generalization beyond grocery

Domains where taxonomy-tree contrastive supervision can replace or augment engagement-derived pairs:

Domain Tree Use case
E-commerce / grocery Product category tree Recsys, search, substitution (Instacart canonical)
Media Genre / sub-genre / mood hierarchies Content recommendation
Library systems LCC, Dewey Document retrieval
Industry data NAICS, GICS Company embedding for search
Knowledge graphs Class hierarchies (DBpedia, Wikidata types) Entity embeddings
Bug tracking Bug taxonomy / component tree Bug-similarity search
Medical ICD codes, MeSH Diagnosis embedding

Caveats

  • Taxonomy quality is an upper bound. Mislabeled items in the tree (Instacart's own Protein-Bar-in-Candy and Sparkling-Water-in-Soda examples) carry noise into the contrastive signal. The codebook ends up partially encoding taxonomy bugs along with semantics.
  • Same-leaf == positive is a simplification. Two items in the same leaf may differ in price, brand, or use-case; the loss treats them as identical positives.
  • Tree depth and balance matter. Shallow trees give only same-leaf vs different-leaf signal; unbalanced trees produce inconsistent sibling-pair semantics.
  • Engagement-blind by design. This pattern alone cannot capture user-revealed cross-category substitutes (e.g. bread-crumbs vs panko if they're in different leaves). Engagement-signal augmentation is the explicit Instacart next-step.
  • λ weight is workload-specific. Calibrating to a new domain requires hyperparameter tuning; the 0.01 value is a domain-specific data point, not a universal default.
  • No publicly disclosed ablation against baselines. Instacart asserts the pattern works but doesn't publish reconstruction-only vs reconstruction+contrastive ablation numbers.

Seen in

  • sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — canonical wiki instance: catalog-tree contrastive regularization for Instacart's RQ-VAE codebook training, with loss formula L_total = L_reconstruction + L_rq + λ · L_contrastive at λ = 0.01, three-class pair labelling (same-leaf strong + / sibling-leaf mod + / no-shared-ancestor −), and hierarchical batch sampling. Explicitly framed as the cold-start-compatible alternative to PLUM's engagement-data approach.
Last updated · 542 distilled / 1,571 read