CONCEPT Cited by 2 sources

Request-level deduplication¶

Definition¶

Request-level deduplication is the family of techniques that store, transform, train on, and serve request-level data once per request, rather than once per candidate item scored in that request. In a recommendation / search / ads funnel, a single user request fans out into N candidate items (tens to thousands), but the user-request features (user history sequence, user embeddings, context, request metadata) are identical across all N rows.

Without explicit deduplication, those features are duplicated N times — in storage, in data-loading pipelines, in training-batch tensors, and in serving-time forward passes. At Pinterest's scale (Foundation Model with ~16K-token user sequences, hundreds-to-thousands of candidates per request), the duplication dominates storage cost, training throughput, and serving latency.

The technique is cross-cutting: the same redundancy manifests differently at each lifecycle stage, and each stage needs its own dedup mechanism — but the wins compound because storage compression speeds up data pipelines, training speedups feed experimentation velocity, and serving-throughput wins fund the next model-scaling round.

Three lifecycle stages (canonical Pinterest framing)¶

From sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication:

Stage	Where duplication lives	Dedup mechanism	Pinterest win
Storage	Parquet rows: one `[user, item, label]` per engagement. User features copied N×.	Sort rows by user ID + request ID in Iceberg → columnar compression absorbs duplicate-value runs. See patterns/sort-by-request-id-for-columnar-compression.	10–50× compression on user-heavy columns.
Training	Data-loader materialises N-row tensors; BatchNorm + in-batch negatives assume IID rows.	Keep data deduplicated through preprocessing + features; expand only on GPU. For ranking: DCAT + SyncBatchNorm. For retrieval: run user tower on unique users + user-level masking.	4× retrieval speedup, ~2.8× ranking speedup.
Serving	Ranker recomputes user history forward pass per candidate.	Two-tower retrieval: dedup by construction. Ranking: DCAT — context pass once + cross-attention from each candidate to cached user-history KV.	7× ranking serving throughput.

Why it works¶

Candidate-scoring batches have extreme entity reuse. Scoring N candidates for one user = N copies of the user sequence, compressible to 1.
Per-unit cost of the duplicated entity is high. ~16K-token user sequences are expensive to fetch, transfer, and feed through a transformer — the marginal cost of one extra copy is significant.
Dedup is transparent to the model. At every stage, model inputs / outputs are preserved; the optimisation is a rewrite of how the same tensor layout is produced.

Why Pinterest canonicalised it as a discipline¶

The three stages are often treated as separate optimisation problems by different teams (data engineers, ML engineers, serving engineers). Pinterest's framing collapses them into one discipline with one mental model — "the same fundamental redundancy exists at every layer" — enabling:

Shared data-loader infrastructure across ranking + retrieval.
Correctness-correction patterns (SyncBatchNorm, user-level masking) that arise naturally once the dedup mental model is in place.
Infrastructure investments that compound — Iceberg sort-order enables bucket joins + incremental features; training dedup enables larger effective batch sizes; serving dedup funds larger models.

Correctness risks introduced by dedup¶

Deduplication changes the shape of training data in ways that break common ML assumptions:

IID disruption. Request-sorted batches concentrate around fewer users; BatchNorm statistics fluctuate, slowing convergence (1–2% offline-metric regression on Pinterest ranking models pre-fix).
In-batch false negatives. In two-tower retrieval with in-batch negatives, another candidate in the same request-sorted batch is now very likely to be a positive for the same user — false-negative rate jumps from ~0% (IID) to ~30% (request-sorted).

These aren't reasons to avoid dedup — they're correction problems with known fixes (SyncBatchNorm, user-level masking) that Pinterest applied and verified recovered baseline quality.

Applicability¶

Dedup targets that fit the discipline:

Candidate-scoring workloads where one entity (user / query / session) is shared across many rows in a batch and has heavy features (embedding lookup, sequence transformer, aggregation tree).
Two-tower retrieval — naturally deduplicable by construction.
Ranking with user-history attention — needs a specialised architecture like DCAT to break the item-candidate coupling.

Not applicable or low-ROI:

Single-candidate scoring (fan-in = 1) — no duplication to exploit.
Tabular CTR models without sequence / embedding components — the per-entity cost is too small to matter.
Workloads where the entity is not shared across batch rows — one user per query, one query per batch.

Generalisations¶

The same discipline applies whenever a shared, heavy entity is scored against many candidates:

Per-query in search ranking (query embedding / sequence shared across candidate results).
Per-session in conversation-aware recsys (session context shared across candidate turns).
Per-context (location / device / conversation) shared across candidate items.
Per-job in LLM inference for batch-mode scoring with shared prompt prefix (KV cache — the inference-stack analogue).

Caveats¶

Dedup magnitude compounds non-linearly with fan-out. For small fan-outs the rewiring cost may outweigh the win.
Dedup correctness corrections are workload-specific. SyncBatchNorm addresses BatchNorm; if the model uses LayerNorm, the IID-disruption mode looks different.
Pinterest doesn't disclose the unique-user cap for broadcast / cross-attention implementations or the batch-size hyperparameter distributions.
Serving dedup in ranking requires a custom architecture (DCAT) — not drop-in; contrast to retrieval where two-tower gives it for free.

Seen in¶

2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki post: framing as cross-cutting discipline, three-stage framework, IID / false-negative correctness corrections, DCAT ranking serving architecture. Scale: 10–50× storage, 4× retrieval training, ~2.8× ranking training, 7× ranking serving.
2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — serving-time-only predecessor: canonicalised concepts/request-level-embedding-broadcast as the narrow serving-side win; the 2026-04-13 post generalises it to the full lifecycle.

concepts/request-level-embedding-broadcast — the narrow serving-time instantiation.
patterns/sort-by-request-id-for-columnar-compression — storage-stage instantiation.
patterns/cached-kv-cross-attention-for-deduplication — ranking-serving instantiation (DCAT).
patterns/deferred-reduplication-at-gpu — training-pipeline discipline.
concepts/iid-disruption-from-request-sorted-data — correctness risk.
concepts/in-batch-negative-false-negative — correctness risk.