CONCEPT Cited by 2 sources
Request-level deduplication¶
Definition¶
Request-level deduplication is the family of techniques that store, transform, train on, and serve request-level data once per request, rather than once per candidate item scored in that request. In a recommendation / search / ads funnel, a single user request fans out into N candidate items (tens to thousands), but the user-request features (user history sequence, user embeddings, context, request metadata) are identical across all N rows.
Without explicit deduplication, those features are duplicated N times — in storage, in data-loading pipelines, in training-batch tensors, and in serving-time forward passes. At Pinterest's scale (Foundation Model with ~16K-token user sequences, hundreds-to-thousands of candidates per request), the duplication dominates storage cost, training throughput, and serving latency.
The technique is cross-cutting: the same redundancy manifests differently at each lifecycle stage, and each stage needs its own dedup mechanism — but the wins compound because storage compression speeds up data pipelines, training speedups feed experimentation velocity, and serving-throughput wins fund the next model-scaling round.
Three lifecycle stages (canonical Pinterest framing)¶
From sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication:
| Stage | Where duplication lives | Dedup mechanism | Pinterest win |
|---|---|---|---|
| Storage | Parquet rows: one [user, item, label] per engagement. User features copied N×. |
Sort rows by user ID + request ID in Iceberg → columnar compression absorbs duplicate-value runs. See patterns/sort-by-request-id-for-columnar-compression. | 10–50× compression on user-heavy columns. |
| Training | Data-loader materialises N-row tensors; BatchNorm + in-batch negatives assume IID rows. | Keep data deduplicated through preprocessing + features; expand only on GPU. For ranking: DCAT + SyncBatchNorm. For retrieval: run user tower on unique users + user-level masking. | 4× retrieval speedup, ~2.8× ranking speedup. |
| Serving | Ranker recomputes user history forward pass per candidate. | Two-tower retrieval: dedup by construction. Ranking: DCAT — context pass once + cross-attention from each candidate to cached user-history KV. | 7× ranking serving throughput. |
Why it works¶
- Candidate-scoring batches have extreme entity reuse. Scoring N candidates for one user = N copies of the user sequence, compressible to 1.
- Per-unit cost of the duplicated entity is high. ~16K-token user sequences are expensive to fetch, transfer, and feed through a transformer — the marginal cost of one extra copy is significant.
- Dedup is transparent to the model. At every stage, model inputs / outputs are preserved; the optimisation is a rewrite of how the same tensor layout is produced.
Why Pinterest canonicalised it as a discipline¶
The three stages are often treated as separate optimisation problems by different teams (data engineers, ML engineers, serving engineers). Pinterest's framing collapses them into one discipline with one mental model — "the same fundamental redundancy exists at every layer" — enabling:
- Shared data-loader infrastructure across ranking + retrieval.
- Correctness-correction patterns (SyncBatchNorm, user-level masking) that arise naturally once the dedup mental model is in place.
- Infrastructure investments that compound — Iceberg sort-order enables bucket joins + incremental features; training dedup enables larger effective batch sizes; serving dedup funds larger models.
Correctness risks introduced by dedup¶
Deduplication changes the shape of training data in ways that break common ML assumptions:
- IID disruption. Request-sorted batches concentrate around fewer users; BatchNorm statistics fluctuate, slowing convergence (1–2% offline-metric regression on Pinterest ranking models pre-fix).
- In-batch false negatives. In two-tower retrieval with in-batch negatives, another candidate in the same request-sorted batch is now very likely to be a positive for the same user — false-negative rate jumps from ~0% (IID) to ~30% (request-sorted).
These aren't reasons to avoid dedup — they're correction problems with known fixes (SyncBatchNorm, user-level masking) that Pinterest applied and verified recovered baseline quality.
Applicability¶
Dedup targets that fit the discipline:
- Candidate-scoring workloads where one entity (user / query / session) is shared across many rows in a batch and has heavy features (embedding lookup, sequence transformer, aggregation tree).
- Two-tower retrieval — naturally deduplicable by construction.
- Ranking with user-history attention — needs a specialised architecture like DCAT to break the item-candidate coupling.
Not applicable or low-ROI:
- Single-candidate scoring (fan-in = 1) — no duplication to exploit.
- Tabular CTR models without sequence / embedding components — the per-entity cost is too small to matter.
- Workloads where the entity is not shared across batch rows — one user per query, one query per batch.
Generalisations¶
The same discipline applies whenever a shared, heavy entity is scored against many candidates:
- Per-query in search ranking (query embedding / sequence shared across candidate results).
- Per-session in conversation-aware recsys (session context shared across candidate turns).
- Per-context (location / device / conversation) shared across candidate items.
- Per-job in LLM inference for batch-mode scoring with shared prompt prefix (KV cache — the inference-stack analogue).
Caveats¶
- Dedup magnitude compounds non-linearly with fan-out. For small fan-outs the rewiring cost may outweigh the win.
- Dedup correctness corrections are workload-specific. SyncBatchNorm addresses BatchNorm; if the model uses LayerNorm, the IID-disruption mode looks different.
- Pinterest doesn't disclose the unique-user cap for broadcast / cross-attention implementations or the batch-size hyperparameter distributions.
- Serving dedup in ranking requires a custom architecture (DCAT) — not drop-in; contrast to retrieval where two-tower gives it for free.
Seen in¶
- 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki post: framing as cross-cutting discipline, three-stage framework, IID / false-negative correctness corrections, DCAT ranking serving architecture. Scale: 10–50× storage, 4× retrieval training, ~2.8× ranking training, 7× ranking serving.
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — serving-time-only predecessor: canonicalised concepts/request-level-embedding-broadcast as the narrow serving-side win; the 2026-04-13 post generalises it to the full lifecycle.
Related¶
- concepts/request-level-embedding-broadcast — the narrow serving-time instantiation.
- patterns/sort-by-request-id-for-columnar-compression — storage-stage instantiation.
- patterns/cached-kv-cross-attention-for-deduplication — ranking-serving instantiation (DCAT).
- patterns/deferred-reduplication-at-gpu — training-pipeline discipline.
- concepts/iid-disruption-from-request-sorted-data — correctness risk.
- concepts/in-batch-negative-false-negative — correctness risk.