PATTERN Cited by 1 source

Sort by request ID for columnar compression¶

Problem¶

In a recommendation-system training dataset, each row looks like [user, request, item, label]. For a single request, the user features (especially long user-history sequences — ~16K tokens at Pinterest) are identical across all rows with the same request_id. Without explicit action, the columnar codec sees those identical values scattered across the file by whatever default ordering the writer produced — row groups mix requests, duplicate user-sequence tokens don't land adjacent, and columnar compression underperforms dramatically.

The same identical user sequence is therefore written hundreds-to-thousands of times per request — once per candidate scored — bloating training datasets and inflating every downstream data-engineering cost (IO, shuffle, backfill, scan).

Pattern¶

Sort the training dataset by (user_id, request_id) before writing columnar files. With the sort order co-locating all rows for a single request physically adjacent in the file:

The columnar codec's run-length / dictionary / delta encoders see consecutive duplicate values on user-heavy columns and compress them automatically — no schema change, no application code change, no explicit deduplication logic.
Per-row-group statistics become tight (min/max/distinct-count on user columns collapse to 1 per group), which also helps skip-scan predicates.

At Pinterest, on Apache Iceberg tables backed by Parquet, this produces 10–50× storage compression on user-heavy feature columns (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):

"By leveraging Apache Iceberg with user ID and request ID based sorting, we achieve 10–50x storage compression on user-heavy feature columns.² When rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically."

Load-bearing insight — let the codec do the work¶

The pattern doesn't explicitly deduplicate anything. The model and the training pipeline still see one row per (request, item, label) triple — the logical dedup happens at the encoded-bytes level, invisible to consumers. No application-level code needs to collapse rows before training; no join / group-by / materialised-view rewiring. The pattern is purely a write-time sort-order choice.

Downstream benefits (beyond raw storage)¶

Pinterest documents additional wins the sort order unlocks (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):

Bucket joins — "matching keys are co-located, eliminating expensive shuffle operations."
Efficient backfills — "we can update only affected user segments rather than reprocessing entire datasets."
Incremental feature engineering — "adding new request-level features becomes a localized operation: we can append new columns to existing row groups without duplicating the entire dataset."
Stratified sampling — "request-sorted data enables user-level sampling, ensuring training datasets maintain proper diversity without over-representing highly active Pinners."

Trade-off — breaks IID¶

The sort order destroys row independence within training batches. Rows for the same user become adjacent, so a mini-batch drawn sequentially from the sorted file is highly correlated. This breaks two ML assumptions that most training infrastructure relies on:

BatchNorm — local-batch statistics become dominated by one user's feature distribution. See concepts/iid-disruption-from-request-sorted-data and the fix in patterns/syncbatchnorm-for-correlated-batches.
In-batch negatives for retrieval — other rows in the batch are likely positives for the anchor user. See concepts/in-batch-negative-false-negative and the fix in patterns/user-level-negative-masking-infonce.

The pattern is therefore only viable when paired with the IID-disruption correctness fixes; naive adoption loses 1–2% of ranking offline metrics + up to 30% false-negative rate in retrieval.

Generalisations¶

The pattern applies whenever a high-cardinality key has heavy shared features across the rows it groups:

Query-sorted search logs — query text + embeddings are shared across all candidate results per query.
Session-sorted event logs — session context is shared across all events per session.
Job-sorted execution logs — job parameters are shared across all tasks per job.

In each case, the sort-key choice trades storage compression + bucket-join wins against the IID-disruption tax; the correctness-fix toolkit (SyncBatchNorm, same-key masking) applies identically.

Caveats¶

Write-side sort cost is not disclosed by Pinterest — the Iceberg write pipeline has to maintain sort order, and if writes are distributed, sorting involves a shuffle.
Compression ratio is workload-dependent — 10–50× is for Pinterest's user-heavy feature columns; the ratio collapses if user features don't dominate the row width.
Column-order matters — gains concentrate on columns that repeat within each (user, request) group; naïve mixed columns may need hive-style partitioning or clustering on top of the sort to see similar ratios.
Sort key choice is load-bearing — sort-by-user_id alone gives duplicate-run compression but loses per-request co-location (request-level features still scatter); Pinterest uses (user_id, request_id) precisely to get both.
Doesn't apply to tabular models without sequence features — if rows don't share heavy columns, there's nothing to compress beyond general-purpose gains.

Seen in¶

2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki pattern instance: Iceberg + (user_id, request_id) sort for 10–50× compression on user-heavy columns; bucket-join + backfill + stratified-sampling as compound wins; SyncBatchNorm + user-level masking as the paired correctness fixes.

concepts/request-level-deduplication — the overarching discipline this pattern is the storage-stage instantiation of.
concepts/iid-disruption-from-request-sorted-data — the correctness risk this pattern introduces.
concepts/bucket-join — the shuffle-elimination win downstream.
patterns/syncbatchnorm-for-correlated-batches — the ranking-training fix for the IID break.
patterns/user-level-negative-masking-infonce — the retrieval-training fix for the IID break.
systems/apache-iceberg / systems/apache-parquet — the substrate Pinterest applies this on.
companies/pinterest