Skip to content

CONCEPT Cited by 1 source

IID disruption from request-sorted data

Definition

IID disruption is the failure mode that appears when training-dataset rows are no longer independent and identically distributed across a batch, typically because the dataset has been sorted by a high-cardinality entity (user, session, request, document) to enable downstream wins like columnar compression or bucket joins.

With IID sampling, each batch contains engagements spread across many users, yielding stable and representative statistics. With request-sorted data, batches become concentrated around fewer users — each gradient update is computed from a less representative slice of the data, and the model sees a noisier, more biased view of the training distribution that slows convergence and degrades final quality.

Pinterest's canonical datum: "1–2% regressions on key offline evaluation metrics in our ranking models" appeared when they first switched to request-sorted training data, before applying correctness fixes (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).

Two concrete failure modes

IID disruption manifests as two different problems that need different fixes, depending on what the model relies on:

1. BatchNorm statistics collapse

Standard BatchNorm normalises intermediate values by computing mean + variance across the local batch on each device. When request-sorted batches are dominated by a single power user, the batch-level statistics fluctuate dramatically — "a batch dominated by a single power user will have dramatically different statistics than one with a casual browser." The running statistics drift, the model struggles to converge.

Fix: SyncBatchNorm — aggregate statistics across all devices before normalisation, increasing the "statistical batch size" to the union of devices (still much larger than any single user).

2. In-batch negative false-positives (retrieval only)

In two-tower retrieval training with in-batch negatives, other candidates in the same batch are used as "negatives." When batches are request-sorted, many in-batch "negatives" are actually positives for the same user — false-negative rate jumps from ~0% (IID) to ~30% (request-sorted). Training the model to push apart items the user actually engaged with actively degrades retrieval quality.

Fix: User-level masking — extend identity masking to exclude candidates whose associated user equals the anchor user. See concepts/in-batch-negative-false-negative.

Why it's a wiki-worthy concept

IID disruption is the named failure mode that explains why sort-order as a storage optimisation has hidden training-time costs. Before Pinterest canonicalised it, practitioners treated "training on sorted data regressed" as a local puzzle. The diagnostic framing separates:

  • The cause: batch-level statistics + in-batch sampling assume IID rows.
  • The trigger: any dataset reorder that concentrates rows around a high-cardinality entity.
  • The fix surface: cross-device statistics + explicit masking of correlated rows.

It generalises beyond request-sorted data:

  • Time-sorted batches — session / hour / day-sorted rows concentrate on one temporal cohort.
  • Locality-sorted batches — geography-sorted rows concentrate on one region.
  • Hash-partitioned batches — any partition scheme that groups related rows into the same mini-batch.

Relationship to the IID assumption

The classical ML assumption is that training rows are drawn IID from the data distribution. Stochastic gradient descent, BatchNorm, contrastive learning, and random negative sampling all rely on this — the batch is a representative sample of the distribution.

Dataset reorderings that group related rows (for storage / compression / locality wins) break the assumption without changing the row contents. The data is still distribution-representative in aggregate; it's just no longer IID within any given batch window.

The tradeoff

Dataset sorting is adopted for non-training reasons:

  • Columnar compression: 10–50× storage reduction (patterns/sort-by-request-id-for-columnar-compression).
  • Bucket joins: eliminate shuffle on matching keys.
  • Incremental feature engineering: append columns to existing row groups.
  • User-level sampling: stratified sampling by user is cheap once sorted.

Each of these wins is material; dropping them to preserve IID is the wrong trade. The correct response is to restore the IID-equivalent behaviour through the correctness corrections (SyncBatchNorm, user-level masking) while keeping the storage + data-engineering wins.

Caveats

  • Scale-dependent: Pinterest's 1–2% regression is for their ranking models with BatchNorm. LayerNorm-only architectures may not exhibit the batch-statistics mode.
  • Unique-users-per-batch dependence: the false-negative rate "depends on the number of unique users per batch" — if batches intentionally mix many users, the rate stays low without masking.
  • SyncBatchNorm cost: communication overhead across devices exists but Pinterest reports it as "negligible compared to the training speedups" — not universally true for tiny batches or weak interconnect.
  • Masking can shrink the effective negative pool: aggressive user-level masking in retrieval training reduces the set of valid negatives per anchor; may need batch-size bumps to compensate.

Seen in

  • 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki instance: 1–2% ranking regression diagnosis via BatchNorm statistics + SyncBatchNorm fix; false-negative rate ~0% → ~30% in retrieval + user-level masking fix. Pinterest's explicit quote: "With IID sampling, each batch contains engagements spread across many users, yielding stable and representative statistics. With request-sorted data, batches become concentrated around fewer users, causing batch-level statistics to fluctuate dramatically based on individual user behavior."
Last updated · 319 distilled / 1,201 read