Skip to content

CONCEPT Cited by 1 source

Hierarchical batch sampling for contrastive loss

Definition

Hierarchical batch sampling is the technique of constructing training batches deliberately from a taxonomy structure so that a catalog-tree contrastive loss always has positive signal to learn from. Random sampling over millions of catalog items would produce batches that are "entirely unrelated — the loss would have no positive signal to learn from." The hierarchical sampler fixes this by pre-committing each batch's structure.

The recipe

For each training batch (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

  1. Pick a random parent category from the taxonomy (e.g. Pasta & Pizza Sauces).
  2. Fill ~half the batch with products from its child categories (e.g. marinara, alfredo, pesto). This guarantees the batch contains sibling-leaf pairs (moderate positives in the contrastive labelling).
  3. Fill the other half with products from unrelated categories (e.g. laundry detergent, dog food). This provides hard negatives.
  4. Within each category slot, sample multiple products. This guarantees the batch contains same-leaf pairs automatically (e.g. two marinara sauces from different brands → strong positives).

Quote: "No explicit pair labeling is needed — the catalog structure does the work." The taxonomy provides the pair labels structurally rather than requiring expensive human-labelled positive/negative pairs.

Why it works

A contrastive loss measures pairwise distances within a batch. If a batch contains only unrelated products, the loss is a no-op (everything is mutually negative; there's nothing to pull together). The catalog tree ensures every batch contains all three pair classes required by the gradient:

Pair class Source in batch Loss action
Same leaf (strong positive) Multi-product sampling within each category slot Pull tightly together
Sibling leaf, shared parent (moderate positive) The "children of one parent" half of the batch Pull moderately together
No shared ancestor (negative) The "unrelated categories" half of the batch Push apart

The construction is agnostic to engagement data — it depends only on the taxonomy structure, which is available for every product on day 1, including cold-start products with no purchase history.

Generalisation: any structured taxonomy works

The pattern generalises to any domain where a hierarchical category tree exists:

  • E-commerce / grocery catalogs (Instacart canonical instance).
  • Media catalogs (Netflix-style genre hierarchies).
  • Library / academic taxonomies (LCC, Dewey).
  • Industry codes (NAICS, GICS).
  • Knowledge graphs with class hierarchies.

The deeper the tree, the more granularity available; a 2-level tree (department → leaf) gives the same-leaf vs different-leaf signal but loses the sibling-leaf gradient.

Caveats

  • Skewed category sizes can bias the sampler. Very large leaf categories (e.g. Frozen Vegetables with thousands of items) will dominate the same-leaf-pair pool; tiny leaves (sparse new categories) won't contribute. Reweighting by category size is a common adjustment, not disclosed in the Instacart post.
  • Tree topology matters. A balanced taxonomy gives clean sibling-leaf signal; an unbalanced one (one parent has 50 children, another has 2) makes sibling-pair semantics inconsistent.
  • The "unrelated categories" definition is implicit. The post defines negatives as "products with no shared ancestor"; in practice the negatives half of the batch is sampled from arbitrary other top-level categories. Whether very-distant negatives (laundry detergent vs marinara sauce) and slightly-less-distant negatives (canned tomatoes vs marinara sauce, where canned tomatoes are not under "Pasta & Pizza Sauces" but ingredient-related) carry the same gradient signal is open.
  • Mixing-ratio hyperparameter. Instacart's recipe uses ~half positive-eligible (children of one parent) and ~half negative (unrelated). The ratio's optimal value is workload-dependent.
  • Random parent-category selection with uniform probability gives rare-category undersampling. Whether to oversample rare categories to fix tail-coverage is a separate design decision.
  • Within-category multi-sampling depth (how many products per category slot) controls same-leaf-pair density but trades batch diversity. Not disclosed in the post.

Seen in

  • sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — first canonical wiki disclosure: hierarchical batch sampling for Instacart's RQ-VAE catalog-tree contrastive training. Pick parent → ~half batch from its children → rest from unrelated categories → multi-sample within each category slot. "No explicit pair labeling is needed — the catalog structure does the work."
Last updated · 542 distilled / 1,571 read