Skip to content

CONCEPT Cited by 2 sources

Request-level deduplication

Definition

Request-level deduplication is the family of techniques that store, transform, train on, and serve request-level data once per request, rather than once per candidate item scored in that request. In a recommendation / search / ads funnel, a single user request fans out into N candidate items (tens to thousands), but the user-request features (user history sequence, user embeddings, context, request metadata) are identical across all N rows.

Without explicit deduplication, those features are duplicated N times — in storage, in data-loading pipelines, in training-batch tensors, and in serving-time forward passes. At Pinterest's scale (Foundation Model with ~16K-token user sequences, hundreds-to-thousands of candidates per request), the duplication dominates storage cost, training throughput, and serving latency.

The technique is cross-cutting: the same redundancy manifests differently at each lifecycle stage, and each stage needs its own dedup mechanism — but the wins compound because storage compression speeds up data pipelines, training speedups feed experimentation velocity, and serving-throughput wins fund the next model-scaling round.

Three lifecycle stages (canonical Pinterest framing)

From sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication:

Stage Where duplication lives Dedup mechanism Pinterest win
Storage Parquet rows: one [user, item, label] per engagement. User features copied N×. Sort rows by user ID + request ID in Iceberg → columnar compression absorbs duplicate-value runs. See patterns/sort-by-request-id-for-columnar-compression. 10–50× compression on user-heavy columns.
Training Data-loader materialises N-row tensors; BatchNorm + in-batch negatives assume IID rows. Keep data deduplicated through preprocessing + features; expand only on GPU. For ranking: DCAT + SyncBatchNorm. For retrieval: run user tower on unique users + user-level masking. 4× retrieval speedup, ~2.8× ranking speedup.
Serving Ranker recomputes user history forward pass per candidate. Two-tower retrieval: dedup by construction. Ranking: DCAT — context pass once + cross-attention from each candidate to cached user-history KV. 7× ranking serving throughput.

Why it works

  • Candidate-scoring batches have extreme entity reuse. Scoring N candidates for one user = N copies of the user sequence, compressible to 1.
  • Per-unit cost of the duplicated entity is high. ~16K-token user sequences are expensive to fetch, transfer, and feed through a transformer — the marginal cost of one extra copy is significant.
  • Dedup is transparent to the model. At every stage, model inputs / outputs are preserved; the optimisation is a rewrite of how the same tensor layout is produced.

Why Pinterest canonicalised it as a discipline

The three stages are often treated as separate optimisation problems by different teams (data engineers, ML engineers, serving engineers). Pinterest's framing collapses them into one discipline with one mental model"the same fundamental redundancy exists at every layer" — enabling:

  • Shared data-loader infrastructure across ranking + retrieval.
  • Correctness-correction patterns (SyncBatchNorm, user-level masking) that arise naturally once the dedup mental model is in place.
  • Infrastructure investments that compound — Iceberg sort-order enables bucket joins + incremental features; training dedup enables larger effective batch sizes; serving dedup funds larger models.

Correctness risks introduced by dedup

Deduplication changes the shape of training data in ways that break common ML assumptions:

  • IID disruption. Request-sorted batches concentrate around fewer users; BatchNorm statistics fluctuate, slowing convergence (1–2% offline-metric regression on Pinterest ranking models pre-fix).
  • In-batch false negatives. In two-tower retrieval with in-batch negatives, another candidate in the same request-sorted batch is now very likely to be a positive for the same user — false-negative rate jumps from ~0% (IID) to ~30% (request-sorted).

These aren't reasons to avoid dedup — they're correction problems with known fixes (SyncBatchNorm, user-level masking) that Pinterest applied and verified recovered baseline quality.

Applicability

Dedup targets that fit the discipline:

  • Candidate-scoring workloads where one entity (user / query / session) is shared across many rows in a batch and has heavy features (embedding lookup, sequence transformer, aggregation tree).
  • Two-tower retrieval — naturally deduplicable by construction.
  • Ranking with user-history attention — needs a specialised architecture like DCAT to break the item-candidate coupling.

Not applicable or low-ROI:

  • Single-candidate scoring (fan-in = 1) — no duplication to exploit.
  • Tabular CTR models without sequence / embedding components — the per-entity cost is too small to matter.
  • Workloads where the entity is not shared across batch rows — one user per query, one query per batch.

Generalisations

The same discipline applies whenever a shared, heavy entity is scored against many candidates:

  • Per-query in search ranking (query embedding / sequence shared across candidate results).
  • Per-session in conversation-aware recsys (session context shared across candidate turns).
  • Per-context (location / device / conversation) shared across candidate items.
  • Per-job in LLM inference for batch-mode scoring with shared prompt prefix (KV cache — the inference-stack analogue).

Caveats

  • Dedup magnitude compounds non-linearly with fan-out. For small fan-outs the rewiring cost may outweigh the win.
  • Dedup correctness corrections are workload-specific. SyncBatchNorm addresses BatchNorm; if the model uses LayerNorm, the IID-disruption mode looks different.
  • Pinterest doesn't disclose the unique-user cap for broadcast / cross-attention implementations or the batch-size hyperparameter distributions.
  • Serving dedup in ranking requires a custom architecture (DCAT) — not drop-in; contrast to retrieval where two-tower gives it for free.

Seen in

Last updated · 319 distilled / 1,201 read