PATTERN Cited by 1 source
Sort by request ID for columnar compression¶
Problem¶
In a recommendation-system training dataset, each row looks like [user, request, item, label]. For a single request, the user features (especially long user-history sequences — ~16K tokens at Pinterest) are identical across all rows with the same request_id. Without explicit action, the columnar codec sees those identical values scattered across the file by whatever default ordering the writer produced — row groups mix requests, duplicate user-sequence tokens don't land adjacent, and columnar compression underperforms dramatically.
The same identical user sequence is therefore written hundreds-to-thousands of times per request — once per candidate scored — bloating training datasets and inflating every downstream data-engineering cost (IO, shuffle, backfill, scan).
Pattern¶
Sort the training dataset by (user_id, request_id) before writing columnar files. With the sort order co-locating all rows for a single request physically adjacent in the file:
- The columnar codec's run-length / dictionary / delta encoders see consecutive duplicate values on user-heavy columns and compress them automatically — no schema change, no application code change, no explicit deduplication logic.
- Per-row-group statistics become tight (min/max/distinct-count on user columns collapse to 1 per group), which also helps skip-scan predicates.
At Pinterest, on Apache Iceberg tables backed by Parquet, this produces 10–50× storage compression on user-heavy feature columns (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):
"By leveraging Apache Iceberg with user ID and request ID based sorting, we achieve 10–50x storage compression on user-heavy feature columns.² When rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically."
Load-bearing insight — let the codec do the work¶
The pattern doesn't explicitly deduplicate anything. The model and the training pipeline still see one row per (request, item, label) triple — the logical dedup happens at the encoded-bytes level, invisible to consumers. No application-level code needs to collapse rows before training; no join / group-by / materialised-view rewiring. The pattern is purely a write-time sort-order choice.
Downstream benefits (beyond raw storage)¶
Pinterest documents additional wins the sort order unlocks (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):
- Bucket joins — "matching keys are co-located, eliminating expensive shuffle operations."
- Efficient backfills — "we can update only affected user segments rather than reprocessing entire datasets."
- Incremental feature engineering — "adding new request-level features becomes a localized operation: we can append new columns to existing row groups without duplicating the entire dataset."
- Stratified sampling — "request-sorted data enables user-level sampling, ensuring training datasets maintain proper diversity without over-representing highly active Pinners."
Trade-off — breaks IID¶
The sort order destroys row independence within training batches. Rows for the same user become adjacent, so a mini-batch drawn sequentially from the sorted file is highly correlated. This breaks two ML assumptions that most training infrastructure relies on:
- BatchNorm — local-batch statistics become dominated by one user's feature distribution. See concepts/iid-disruption-from-request-sorted-data and the fix in patterns/syncbatchnorm-for-correlated-batches.
- In-batch negatives for retrieval — other rows in the batch are likely positives for the anchor user. See concepts/in-batch-negative-false-negative and the fix in patterns/user-level-negative-masking-infonce.
The pattern is therefore only viable when paired with the IID-disruption correctness fixes; naive adoption loses 1–2% of ranking offline metrics + up to 30% false-negative rate in retrieval.
Generalisations¶
The pattern applies whenever a high-cardinality key has heavy shared features across the rows it groups:
- Query-sorted search logs — query text + embeddings are shared across all candidate results per query.
- Session-sorted event logs — session context is shared across all events per session.
- Job-sorted execution logs — job parameters are shared across all tasks per job.
In each case, the sort-key choice trades storage compression + bucket-join wins against the IID-disruption tax; the correctness-fix toolkit (SyncBatchNorm, same-key masking) applies identically.
Caveats¶
- Write-side sort cost is not disclosed by Pinterest — the Iceberg write pipeline has to maintain sort order, and if writes are distributed, sorting involves a shuffle.
- Compression ratio is workload-dependent — 10–50× is for Pinterest's user-heavy feature columns; the ratio collapses if user features don't dominate the row width.
- Column-order matters — gains concentrate on columns that repeat within each
(user, request)group; naïve mixed columns may need hive-style partitioning or clustering on top of the sort to see similar ratios. - Sort key choice is load-bearing — sort-by-
user_idalone gives duplicate-run compression but loses per-request co-location (request-level features still scatter); Pinterest uses(user_id, request_id)precisely to get both. - Doesn't apply to tabular models without sequence features — if rows don't share heavy columns, there's nothing to compress beyond general-purpose gains.
Seen in¶
- 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki pattern instance: Iceberg +
(user_id, request_id)sort for 10–50× compression on user-heavy columns; bucket-join + backfill + stratified-sampling as compound wins; SyncBatchNorm + user-level masking as the paired correctness fixes.
Related¶
- concepts/request-level-deduplication — the overarching discipline this pattern is the storage-stage instantiation of.
- concepts/iid-disruption-from-request-sorted-data — the correctness risk this pattern introduces.
- concepts/bucket-join — the shuffle-elimination win downstream.
- patterns/syncbatchnorm-for-correlated-batches — the ranking-training fix for the IID break.
- patterns/user-level-negative-masking-infonce — the retrieval-training fix for the IID break.
- systems/apache-iceberg / systems/apache-parquet — the substrate Pinterest applies this on.
- companies/pinterest