CONCEPT Cited by 1 source
Hierarchical batch sampling for contrastive loss¶
Definition¶
Hierarchical batch sampling is the technique of constructing training batches deliberately from a taxonomy structure so that a catalog-tree contrastive loss always has positive signal to learn from. Random sampling over millions of catalog items would produce batches that are "entirely unrelated — the loss would have no positive signal to learn from." The hierarchical sampler fixes this by pre-committing each batch's structure.
The recipe¶
For each training batch (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):
- Pick a random parent category from the taxonomy (e.g. Pasta & Pizza Sauces).
- Fill ~half the batch with products from its child categories (e.g. marinara, alfredo, pesto). This guarantees the batch contains sibling-leaf pairs (moderate positives in the contrastive labelling).
- Fill the other half with products from unrelated categories (e.g. laundry detergent, dog food). This provides hard negatives.
- Within each category slot, sample multiple products. This guarantees the batch contains same-leaf pairs automatically (e.g. two marinara sauces from different brands → strong positives).
Quote: "No explicit pair labeling is needed — the catalog structure does the work." The taxonomy provides the pair labels structurally rather than requiring expensive human-labelled positive/negative pairs.
Why it works¶
A contrastive loss measures pairwise distances within a batch. If a batch contains only unrelated products, the loss is a no-op (everything is mutually negative; there's nothing to pull together). The catalog tree ensures every batch contains all three pair classes required by the gradient:
| Pair class | Source in batch | Loss action |
|---|---|---|
| Same leaf (strong positive) | Multi-product sampling within each category slot | Pull tightly together |
| Sibling leaf, shared parent (moderate positive) | The "children of one parent" half of the batch | Pull moderately together |
| No shared ancestor (negative) | The "unrelated categories" half of the batch | Push apart |
The construction is agnostic to engagement data — it depends only on the taxonomy structure, which is available for every product on day 1, including cold-start products with no purchase history.
Generalisation: any structured taxonomy works¶
The pattern generalises to any domain where a hierarchical category tree exists:
- E-commerce / grocery catalogs (Instacart canonical instance).
- Media catalogs (Netflix-style genre hierarchies).
- Library / academic taxonomies (LCC, Dewey).
- Industry codes (NAICS, GICS).
- Knowledge graphs with class hierarchies.
The deeper the tree, the more granularity available; a 2-level tree (department → leaf) gives the same-leaf vs different-leaf signal but loses the sibling-leaf gradient.
Caveats¶
- Skewed category sizes can bias the sampler. Very large leaf categories (e.g. Frozen Vegetables with thousands of items) will dominate the same-leaf-pair pool; tiny leaves (sparse new categories) won't contribute. Reweighting by category size is a common adjustment, not disclosed in the Instacart post.
- Tree topology matters. A balanced taxonomy gives clean sibling-leaf signal; an unbalanced one (one parent has 50 children, another has 2) makes sibling-pair semantics inconsistent.
- The "unrelated categories" definition is implicit. The post defines negatives as "products with no shared ancestor"; in practice the negatives half of the batch is sampled from arbitrary other top-level categories. Whether very-distant negatives (laundry detergent vs marinara sauce) and slightly-less-distant negatives (canned tomatoes vs marinara sauce, where canned tomatoes are not under "Pasta & Pizza Sauces" but ingredient-related) carry the same gradient signal is open.
- Mixing-ratio hyperparameter. Instacart's recipe uses ~half positive-eligible (children of one parent) and ~half negative (unrelated). The ratio's optimal value is workload-dependent.
- Random parent-category selection with uniform probability gives rare-category undersampling. Whether to oversample rare categories to fix tail-coverage is a separate design decision.
- Within-category multi-sampling depth (how many products per category slot) controls same-leaf-pair density but trades batch diversity. Not disclosed in the post.
Seen in¶
- sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — first canonical wiki disclosure: hierarchical batch sampling for Instacart's RQ-VAE catalog-tree contrastive training. Pick parent → ~half batch from its children → rest from unrelated categories → multi-sample within each category slot. "No explicit pair labeling is needed — the catalog structure does the work."
Related¶
- concepts/contrastive-regularization-with-catalog-structure — the loss this sampler enables.
- concepts/semantic-id — the codebook this trains.
- systems/rq-vae — the algorithm extended.
- systems/instacart-semantic-ids — production instance.
- patterns/contrastive-loss-via-taxonomy-tree — the broader pattern.
- patterns/rq-vae-codebook-as-product-vocabulary — broader pattern this is an ingredient of.