Skip to content

CONCEPT Cited by 1 source

In-batch negative false-negative

Definition

In contrastive two-tower retrieval training, the in-batch negative sampling trick uses other candidates in the same training batch as negative examples for each anchor user-positive pair — saves the cost of sampling random negatives from a huge catalog.

A false negative is an "in-batch negative" item that is actually a positive for the anchor user (the user did engage with it, it's just listed as a different row's positive). Training the model to push the anchor user's embedding away from that item actively degrades retrieval quality because the model learns to avoid items the user actually engaged with.

With IID-sampled batches, the false-negative rate is near zero: users engage with a tiny fraction of the total item corpus, so the probability that a random in-batch item is also a positive for the anchor is negligible.

With request-sorted batches (see concepts/iid-disruption-from-request-sorted-data), batches concentrate around fewer users and "each user may have dozens or hundreds of engagements grouped together." The false-negative rate jumps from ~0% to ~30% on Pinterest workloads, depending on the number of unique users per batch (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).

The mechanism

Standard InfoNCE loss with logit correction (Yi et al., 2019) uses the similarity function s(x, y) (dot product between user embedding x and item embedding y):

L_i = -log[ exp(s(x_i, y_i) - log p_y_i)
           / sum_k exp(s(x_i, y_k) - log p_y_k) ]

where (x_i, y_i) is the anchor user-positive pair and {y_k} are candidates in batch B.

Under IID:

batch = [ (user_A, item_1 engaged),
          (user_B, item_2 engaged),
          (user_C, item_3 engaged),
          (user_D, item_4 engaged) ]

For anchor (user_A, item_1):
  positive = item_1
  negatives = { item_2, item_3, item_4 }  ← all items user_A did NOT engage with
  false-negative rate: ~0%

Under request-sorted:

batch = [ (user_A, item_1),
          (user_A, item_2),   ← user_A DID engage with item_2
          (user_A, item_3),   ← user_A DID engage with item_3
          (user_B, item_4) ]

For anchor (user_A, item_1):
  positive = item_1
  negatives = { item_2, item_3, item_4 }
              └────┬────┘
              FALSE NEGATIVES: user_A engaged with these
  false-negative rate: 2/3 ≈ 67% for this anchor

Pinterest's measured: "The false negative rate jumps from ~0% with IID sampling to as high as ~30% with request-sorted data."

The fix — user-level masking

Extend identity masking (already used to exclude the anchor's own positive y_i from negatives) to exclude any candidate whose user equals the anchor's user:

L_i = -log[ exp(s(x_i, y_i) - log p_y_i)
           / sum_{k : x_k ≠ x_i} exp(s(x_i, y_k) - log p_y_k) ]

The x_k ≠ x_i constraint means: only candidates from different users count as valid negatives. Canonical patterns/user-level-negative-masking-infonce.

Why the wiki canonicalises it

The false-negative problem is an under-appreciated cost of dataset reorderings (sorting, locality-grouping) done for non-training reasons. Naming it makes the diagnostic conversation concrete:

  • Diagnosis: "What's the false-negative rate in your request-sorted training batches?"
  • Fix surface: identity masking in InfoNCE → extend to same-user / same-session / same-context masking.
  • Trade-off: each masked candidate shrinks the effective negative pool — may need larger batches to compensate.

Generalisations

  • Same-query in search: if batches group by query, other candidates for the same query are false negatives for the anchor.
  • Same-session: session-sorted batches suffer the same pathology.
  • Same-context (time window, geography): any grouping that aligns with the positive-signal structure.

Caveats

  • Rate is workload-dependent: ~30% is Pinterest-specific; depends on unique-users-per-batch, user activity distribution, item corpus size.
  • Pool shrink: aggressive masking reduces valid negatives per anchor; may degrade gradient quality if batches are small.
  • Not a BatchNorm problem: this is the retrieval-specific half of IID disruption — ranking models have a different failure mode (BatchNorm statistics, fixed by SyncBatchNorm).

Seen in

Last updated · 319 distilled / 1,201 read