Pinterest — Scaling Recommendation Systems with Request-Level Deduplication¶
Summary¶
Pinterest Engineering retrospective (Matt Lawhon, Filip Ryzner, Kousik Rajesh, Chen Yang, Saurabh Vishwas Joshi) on request-level deduplication as "the single highest-impact technique we've deployed to hold costs in check" while scaling their Pinterest Foundation Model (ACM RecSys 2025 oral spotlight) by 100× in transformer dense parameters and 10× in model dimension. The core insight: in a recsys funnel (retrieval → ranking), request-level data — dominated by ~16K-token user sequences that power sequential user-understanding components like the Foundation Model and TransAct — is identical across every candidate item scored in a request, yet without explicit deduplication is stored, loaded, trained on, and served once per item rather than once per request. Three lifecycle stages each get their own deduplication treatment: storage (sort by user/request ID in Iceberg → 10–50× columnar compression on user-heavy columns), training (request-sorted batches break the IID assumption — fixed with SyncBatchNorm for ranking + user-level masking for retrieval), and serving (two-tower retrieval is deduplicated by construction; ranking gets DCAT — Deduplicated Cross-Attention Transformer with cached user-history KV + per-candidate cross-attention via custom Triton kernels). Net wins (US, 2025, Pinterest internal data, citation "²"): 10–50× storage compression, 4× retrieval training speedup, ~2.8× ranking training speedup (40% from deduplicated data loading × 2× from DCAT), and 7× ranking serving throughput — the envelope that "made it possible to deploy a 100× larger model without proportional serving cost increases."
Key takeaways¶
-
Request-level deduplication is a cross-cutting technique, not a single optimisation. "The same fundamental redundancy exists at every layer." Storage, training, and serving all duplicate user-request data once per candidate item; each needs its own dedup mechanism but the payoff compounds across the stack because storage compression feeds data-pipeline speed, training speedups feed experimentation velocity, and serving throughput feeds capacity for the next model-scaling round (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
User sequences are the weight to amortise. "Request-level data is massive. It largely consists of user sequences, approximately 16K tokens encoding all actions a user has taken on the platform." These sequences power sequential understanding components like the Pinterest Foundation Model (arXiv 2507.12704) and TransAct (arXiv 2506.02267). Each sequence is "duplicated identically for every candidate item scored, hundreds to thousands of copies per request." Canonical concepts/request-level-deduplication target (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Storage: sort-by-user+request-ID + Iceberg → 10–50× columnar compression. "By leveraging Apache Iceberg with user ID and request ID based sorting, we achieve 10–50x storage compression on user-heavy feature columns.²" Mechanism: "when rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically." Downstream wins — bucket joins ("matching keys are co-located, eliminating expensive shuffle operations"), efficient backfills (update affected user segments without reprocessing full datasets), incremental feature engineering (append columns to existing row groups), and user-level stratified sampling ("ensuring training datasets maintain proper diversity without over-representing highly active Pinners"). Canonical patterns/sort-by-request-id-for-columnar-compression wiki instance (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Training correctness — IID disruption is the first failure mode. Request-sorted batches break the IID assumption: batches become concentrated around fewer users, batch-level statistics fluctuate dramatically, "each gradient update is computed from a less representative slice of the data: the model sees a noisier, more biased view of the training distribution, which slows convergence and degrades final quality." Concrete observed impact: "1–2% regressions on key offline evaluation metrics in our ranking models.²" Root cause isolated to BatchNorm — standard BatchNorm computes mean/variance on each device's local batch, which is now dominated by one power user (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
SyncBatchNorm is a one-line fix for correlated batches. "SyncBatchNorm aggregates statistics across all devices before normalization. This effectively increases the 'statistical batch size' used for computing means and variances, even though each device still processes its local request-sorted batch." Result: "this simple one-line change fully recovered the performance gap. The communication overhead of synchronizing statistics across devices was negligible compared to the training speedups gained from deduplicated computation." Canonical patterns/syncbatchnorm-for-correlated-batches instance — applicable anywhere local batches become non-IID after a dataset reordering (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Retrieval correctness — false-negatives jump from ~0% to ~30% with request-sorted batches. With IID sampling, the probability that a randomly sampled in-batch negative is actually a positive for the anchor user is negligible: users engage with a tiny fraction of items. With request-sorted data, "batches are concentrated around fewer users, and each user may have dozens or hundreds of engagements grouped together. Many in-batch 'negatives' are actually items the user engaged with, they're false negatives. The false negative rate jumps from ~0% with IID sampling to as high as ~30% with request-sorted data, depending on the number of unique users per batch.²" Training the model to push apart items the user actually engaged with "actively degrades retrieval quality" (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
User-level masking — extend identity masking to same-user negatives. Standard InfoNCE with logit correction becomes a variant where candidates are only valid negatives if
x_k ≠ x_i(the candidate's user differs from the anchor's user). "This simple masking change allowed us to successfully adopt request-sorted data for retrieval model training while preserving model quality." Canonical patterns/user-level-negative-masking-infonce wiki instance (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication). -
Training throughput — dedup stays end-to-end; rehydrate on GPU. "Our data loading infrastructure, shared across ranking and retrieval models, is designed to maintain deduplication as long as possible in the pipeline. All preprocessing and feature transformations operate on deduplicated request-level data. We only reduplicate (expand) at the very end, on GPU or directly in the model's forward pass." Minimises CPU-to-GPU transfer + memory allocation. Canonical patterns/deferred-reduplication-at-gpu wiki instance (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Retrieval — two-tower architecture is deduplicable by definition. "Achieving request-level compute deduplication in retrieval models is straightforward thanks to the two-tower architecture. Since the user tower has no item dependencies by definition, we rewrite the forward pass to run the user tower on the deduplicated batch of R unique requests rather than the full batch of B user-item pairs. The item tower continues to operate on the full batch. Gradients for the user tower are computed at the deduplicated level and appropriately accumulated." Serving is naturally deduplicated too ("we embed the user once and search against the item index. No changes were needed."). Extends concepts/two-tower-architecture with the training-time deduplication rationale (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Ranking — DCAT (Deduplicated Cross-Attention Transformer) for attention-coupled workloads. Ranking transformers have item dependencies: each candidate attends to user history, coupling request-level + item-level computation. DCAT splits the transformer into: (1) Context — apply transformer to user history sequence once per deduplicated request, cache keys/values from each layer; (2) Crossing — each candidate performs cross-attention with the cached user-history KV, "reusing the deduplicated context computation." Implemented with custom Triton kernels for training + serving. "Achieved significant throughput gains over standard self-attention with FlashAttention." Canonical patterns/cached-kv-cross-attention-for-deduplication instance. Described in detail in the Pinterest Foundation Model paper (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).
-
Production impact — load-bearing operational numbers. Taken together, request-level deduplication delivered (Pinterest internal data, US, 2025, citation "²"):
- Storage: 10–50× compression on user-heavy feature columns via Iceberg + request sorting.
- Training: 4× end-to-end speedup for retrieval; ~2.8× speedup for ranking (40% from deduplicated data loading × 2× from DCAT cross-attention).
- Serving: 7× increase in ranking serving throughput — "what made it possible to deploy a 100× larger model without proportional serving cost increases, absorbing the full Foundation Model scaleup while holding infrastructure budgets in check."
(Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication)
Systems¶
- systems/pinterest-foundation-model — Pinterest's recsys foundation model; ACM RecSys 2025 oral spotlight. 100× transformer dense parameter growth + 10× model-dimension growth over prior ranking models. The scaling driver that motivated the deduplication work. Uses DCAT at serving time. (arXiv 2507.12704)
- TransAct — Pinterest's sequential user-action Transformer model (arXiv 2506.02267). Named alongside the Foundation Model as a primary consumer of ~16K-token user sequences.
- DCAT — Deduplicated Cross-Attention Transformer — Pinterest's ranking-transformer architecture separating context (transform user history once per request, cache KV) from crossing (each candidate cross-attends to cached user-history KV). Implemented via custom Triton kernels. Achieved ~2× training + 7× serving gain over standard self-attention with FlashAttention.
- systems/pinterest-ads-engagement-model — Pinterest's unified CTR-prediction model (2026-03-03 post); canonical home for request-level embedding broadcast, a related but serving-only deduplication technique. The 2026-04-13 post generalises dedup across all three lifecycle stages.
- systems/apache-iceberg — Table format over Parquet on object storage; Pinterest sorts rows by user ID + request ID so that columnar compression absorbs the duplicated user sequences automatically. Canonical Iceberg-as-training-dataset-substrate-with-sort-order-as-optimisation wiki instance, distinct from prior telemetry / CDC-sink / table-format instances.
- systems/pytorch — Training + serving substrate (implicit; consistent with prior Pinterest ML platform posts).
- systems/transformer — The underlying architecture being scaled. DCAT is a redesign of how transformer attention couples to the batch structure.
- systems/flash-attention — The baseline self-attention implementation that DCAT beats at ranking workloads by exploiting request-level KV reuse. Pinterest's explicit comparator.
- Triton — OpenAI's GPU kernel DSL; Pinterest writes custom training + serving kernels for DCAT's cached-KV cross-attention shape.
Concepts / patterns extracted¶
New concepts:
- concepts/request-level-deduplication — the cross-cutting meta-technique: identify duplicated request-level data across lifecycle stages and dedup at each one.
- concepts/iid-disruption-from-request-sorted-data — request-sorted batches concentrate around fewer users, breaking IID assumptions that BatchNorm + random negative sampling rely on.
- concepts/in-batch-negative-false-negative — in two-tower retrieval training, when other items in the batch are used as negatives, a "negative" that's actually a positive for the anchor user pushes embeddings in the wrong direction. Rate jumps from ~0% (IID) to ~30% (request-sorted) on Pinterest workloads.
- concepts/bucket-join — join performed over pre-sorted / co-located data so that matching keys live in the same file group, eliminating shuffle.
New patterns:
- patterns/sort-by-request-id-for-columnar-compression — sort training-dataset rows by the entity that is duplicated in feature columns (request ID / user ID); let the columnar codec handle the dedup.
- patterns/syncbatchnorm-for-correlated-batches — when training-batch rows are correlated (shared user / request / session), aggregate BatchNorm statistics cross-device before normalisation so the "statistical batch size" is the union of devices.
- patterns/user-level-negative-masking-infonce — extend InfoNCE identity-masking to exclude candidates that share the anchor's user (or query / context), preventing false-negative gradient from contrasting user against their own engagements.
- patterns/cached-kv-cross-attention-for-deduplication — split a transformer into a context pass that runs once per unique request (cache keys/values) + a crossing pass where candidates cross-attend to the cache; implement with custom kernels for dense item-candidate fan-in.
- patterns/deferred-reduplication-at-gpu — keep request-level data deduplicated as far into the pipeline as possible; expand back to per-row layout only at the last moment on the accelerator.
Extends:
- patterns/request-level-user-embedding-broadcast — the 2026-03-03 unified-ads-engagement-model ingest canonicalised request-level broadcast at serving time only. This post establishes deduplication as a cross-cutting discipline spanning storage + training + serving, with broadcast / cross-attention / columnar-compression as the stage-specific instantiations.
- concepts/two-tower-architecture — adds the training-time user-tower deduplication rationale: run user tower on
Runique requests, item tower on fullBuser-item pairs, accumulate gradients at the deduplicated level. - concepts/long-user-sequence-modeling — quantifies the scale of the sequences ("approximately 16K tokens encoding all actions a user has taken on the platform") and the redundancy amplification in candidate-scoring batches.
- systems/apache-iceberg — third distinct Pinterest instance on the wiki: table-format (2024-05-14 HBase), quota-telemetry substrate (2026-02-24 Piqama), ML training-dataset substrate with sort-key-as-optimisation (this post).
Architectural numbers¶
- Pinterest Foundation Model scaling: 100× transformer dense parameter count, 10× model dimension (vs. prior ranking models).
- User sequence size: ~16K tokens.
- Per-request amplification: user sequence duplicated "hundreds to thousands of copies per request."
- Storage compression: 10–50× on user-heavy feature columns (Iceberg + request-ID sort).
- Training IID-disruption regression (pre-fix): 1–2% offline-metric regression on ranking models.
- In-batch false-negative rate: ~0% (IID) → up to ~30% (request-sorted), depending on unique users per batch.
- Training speedups (post-fix): 4× retrieval end-to-end; ~2.8× ranking end-to-end (40% from deduplicated data loading × 2× from DCAT).
- Serving throughput: 7× ranking serving throughput gain from DCAT.
All numbers carry footnote "²" in the original — Pinterest internal data, US, 2025.
Caveats¶
- Aggregate training/serving numbers only, no p50 / p99 latency percentiles, no per-model breakdown, no HBM / throughput / power numbers.
- No DCAT architectural detail disclosed beyond the two-phase context/crossing split — the paper linkage (arXiv 2507.12704) is the follow-up; kernel shape, sequence-length handling, layer-count-vs-speedup curve, batch-size + unique-user stats not in this post.
- Triton kernel specifics not disclosed — implementation tradeoffs vs. FlashAttention's fused matmul + online softmax are asserted qualitatively ("significant throughput gains") but not quantified against a FlashAttention baseline with the same context cache.
- Row-sort cost not disclosed — Pinterest writes that "bucket joins" and "efficient backfills" benefit from sorted data, but the write-side cost of maintaining user-ID + request-ID sorted Iceberg tables is not addressed; likely absorbed into the existing Iceberg pipeline.
- SyncBatchNorm overhead note is qualitative — "negligible compared to the training speedups gained", but no cross-device-bandwidth numbers.
- False-negative rate of ~30% is "depending on the number of unique users per batch" — the distribution and the hyperparameter (users per batch) aren't given.
- User-level masking correctness framing is intuitive but the post doesn't show training-curve evidence pre/post the masking change; it asserts parity recovery.
- Scope is sequential user-understanding components (Foundation Model, TransAct) — not claimed to apply to tabular CTR models without such components; the 7× serving throughput is a ranking-with-DCAT number.
- Same-user masking trade-off — aggressive user-level masking can shrink effective negative-sample pool size; the post doesn't discuss whether Pinterest tuned batch size up to compensate.
Source¶
- Original: https://medium.com/pinterest-engineering/scaling-recommendation-systems-with-request-level-deduplication-93bd514142d9?source=rss----4c5a5f6279b6---4
- Raw markdown:
raw/pinterest/2026-04-13-scaling-recommendation-systems-with-request-level-deduplicat-3dad2699.md
Related¶
- companies/pinterest
- systems/pinterest-ads-engagement-model — serving-time embedding-broadcast sibling.
- sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces — the serving-time dedup predecessor post (request-level user-embedding broadcasting).
- sources/2026-04-07-pinterest-evolution-of-multi-objective-optimization-at-pinterest-home — different stage (MOO / slate) but same Pinterest recsys stack.
- sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — two-tower L1 CVR debugging on a different axis.