PATTERN Cited by 1 source
Deferred re-duplication at GPU¶
Problem¶
Request-level data in a recommendation-system batch — especially ~16K-token user sequences — is identical across all candidates in a request. Storing, loading, and transforming it deduplicated is cheap; expanding it to one-copy-per-candidate is expensive in two compounding ways:
- CPU-to-GPU transfer cost: the duplicated sequence crosses the PCIe bus
Ntimes per request (forNcandidates), paying memory bandwidth + transfer latency on each copy. - Memory allocation overhead: allocating
N-row tensors on each side (CPU and GPU) consumes DRAM/HBM + allocator cycles — especially expensive for the wide user-sequence column.
The naïve pipeline materialises the expanded [user_duplicated, item, label] tensor early — during feature-engineering or data-loading — and carries that expanded shape through every downstream stage. Every stage then pays the duplication tax in its own currency (shuffle, serialize, allocate, transfer).
Pattern¶
Keep request-level data deduplicated as far into the training pipeline as possible. Expand to per-candidate layout only at the very end — on the GPU, ideally in the model's forward pass.
Pinterest's canonical description:
"Our data loading infrastructure, shared across ranking and retrieval models, is designed to maintain deduplication as long as possible in the pipeline. All preprocessing and feature transformations operate on deduplicated request-level data. We only reduplicate (expand) at the very end, on GPU or directly in the model's forward pass. This minimizes CPU-to-GPU transfer costs and memory allocation overhead." (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication)
The pipeline shape becomes:
Storage (deduplicated via sort-order, see
patterns/sort-by-request-id-for-columnar-compression)
↓
Feature engineering (operates on deduplicated R-row tensors per request)
↓
Preprocessing + transforms (still deduplicated)
↓
CPU → GPU transfer (R copies, not B)
↓
On GPU, in forward pass: expand to B by index-gather / broadcast
(or: consume deduplicated via DCAT etc.)
↓
Loss / backward
Every stage upstream of the GPU sees R unique requests instead of B user-item pairs. Only the model's forward pass — where expansion is needed for loss computation — sees the per-candidate view.
Why CPU-to-GPU is the critical choke point¶
At Pinterest's scale, the ~16K-token user sequence column is a multi-MB per-request tensor. Duplicating it N× on the CPU side before DMA balloons both the serialised tensor size and the PCIe transfer volume. The deferred-expansion pattern puts the duplication step inside the GPU (as an index_gather or broadcast op) where:
- HBM bandwidth is 10–100× higher than PCIe.
- Allocator cycles are cheaper (contiguous HBM allocator, pre-warmed pools).
- The expanded tensor is consumed immediately by the next kernel, so it often doesn't need to persist — some architectures like DCAT avoid the explicit expansion entirely by cross-attending against cached deduplicated KV.
Infrastructure implications¶
Pinterest frames this as a shared data-loader discipline across both ranking and retrieval models:
"Our data loading infrastructure, shared across ranking and retrieval models, is designed to maintain deduplication..."
This is a load-bearing organisational choice — one data-loader rather than separate loaders per model family — because both families benefit from the same dedup-preserving invariant. Building the dedup primitive into the shared infra (rather than asking each model to handle it) makes the discipline mechanical rather than optional.
When to apply¶
Apply whenever all of these hold:
- Per-row fan-out is high — one entity (user, query, session) maps to many rows in the batch.
- Per-entity data is wide / heavy — embedding tables, long sequences, dense feature vectors.
- Downstream stages preserve shape — the pipeline's natural layout isn't forcing expansion for a legitimate reason (e.g., per-row feature engineering).
When not to apply¶
- Low fan-out (one candidate per request) — nothing to deduplicate.
- Entity-specific per-row transforms — e.g., per-(user, item) features computed from both sides; dedup is only valid for per-entity features.
- Complex join graphs — features joined from multiple sides at different cardinalities complicate dedup; may be easier to re-duplicate early.
Compounding effect¶
Deferred re-duplication is the training-pipeline instantiation of request-level deduplication; it composes with:
- Storage-stage dedup (sort-by-request-id) — the storage layer feeds deduplicated rows; no extra work at load time.
- Serving-stage dedup (DCAT) — the model architecture itself consumes deduplicated KV and never needs a physical expansion.
Together: the user-sequence tensor exists once per request from storage to model-forward, with expansion happening either implicitly (in the attention kernel) or deferred to the last possible moment (on GPU).
Caveats¶
- Pinterest doesn't disclose implementation details of the shared data loader — streaming vs. map-style dataset, shard-boundary handling, shuffle-buffer design, index-gather mechanism.
- Cross-entity features (features that depend on both user + item) still have to live at per-row granularity; the deduplication is a per-entity-feature optimisation, not a global restructure.
- Not free — the GPU-side expand step (
torch.index_select/ gather / broadcast) consumes HBM bandwidth and may not compose cleanly with all downstream ops; still faster than CPU-side expansion in Pinterest's measured workloads. - Shape discipline on the data-loader side is a maintenance cost — every new feature transform has to respect the deduplicated-until-GPU invariant or it silently regresses throughput.
- Pipeline fan-out heterogeneity — if batches mix requests with widely varying candidate counts, the
B/Rratio varies and the dedup win varies per batch.
Seen in¶
- 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki pattern instance: shared data-loader discipline across ranking + retrieval; preprocessing + feature transforms operate on deduplicated shape; expansion deferred to GPU / forward pass; named as the key component responsible for 40% of the ~2.8× ranking-training speedup (the other 2× comes from DCAT cross-attention).
Related¶
- concepts/request-level-deduplication — the overarching discipline.
- patterns/sort-by-request-id-for-columnar-compression — the storage-stage dedup this feeds from.
- patterns/cached-kv-cross-attention-for-deduplication — the serving-stage dedup this ends at (DCAT consumes deduplicated KV without ever needing a physical expansion).
- systems/pinterest-foundation-model — canonical consumer.
- companies/pinterest