Skip to content

PATTERN Cited by 1 source

Sort by request ID for columnar compression

Problem

In a recommendation-system training dataset, each row looks like [user, request, item, label]. For a single request, the user features (especially long user-history sequences — ~16K tokens at Pinterest) are identical across all rows with the same request_id. Without explicit action, the columnar codec sees those identical values scattered across the file by whatever default ordering the writer produced — row groups mix requests, duplicate user-sequence tokens don't land adjacent, and columnar compression underperforms dramatically.

The same identical user sequence is therefore written hundreds-to-thousands of times per request — once per candidate scored — bloating training datasets and inflating every downstream data-engineering cost (IO, shuffle, backfill, scan).

Pattern

Sort the training dataset by (user_id, request_id) before writing columnar files. With the sort order co-locating all rows for a single request physically adjacent in the file:

  • The columnar codec's run-length / dictionary / delta encoders see consecutive duplicate values on user-heavy columns and compress them automatically — no schema change, no application code change, no explicit deduplication logic.
  • Per-row-group statistics become tight (min/max/distinct-count on user columns collapse to 1 per group), which also helps skip-scan predicates.

At Pinterest, on Apache Iceberg tables backed by Parquet, this produces 10–50× storage compression on user-heavy feature columns (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):

"By leveraging Apache Iceberg with user ID and request ID based sorting, we achieve 10–50x storage compression on user-heavy feature columns.² When rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically."

Load-bearing insight — let the codec do the work

The pattern doesn't explicitly deduplicate anything. The model and the training pipeline still see one row per (request, item, label) triple — the logical dedup happens at the encoded-bytes level, invisible to consumers. No application-level code needs to collapse rows before training; no join / group-by / materialised-view rewiring. The pattern is purely a write-time sort-order choice.

Downstream benefits (beyond raw storage)

Pinterest documents additional wins the sort order unlocks (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):

  • Bucket joins"matching keys are co-located, eliminating expensive shuffle operations."
  • Efficient backfills"we can update only affected user segments rather than reprocessing entire datasets."
  • Incremental feature engineering"adding new request-level features becomes a localized operation: we can append new columns to existing row groups without duplicating the entire dataset."
  • Stratified sampling"request-sorted data enables user-level sampling, ensuring training datasets maintain proper diversity without over-representing highly active Pinners."

Trade-off — breaks IID

The sort order destroys row independence within training batches. Rows for the same user become adjacent, so a mini-batch drawn sequentially from the sorted file is highly correlated. This breaks two ML assumptions that most training infrastructure relies on:

The pattern is therefore only viable when paired with the IID-disruption correctness fixes; naive adoption loses 1–2% of ranking offline metrics + up to 30% false-negative rate in retrieval.

Generalisations

The pattern applies whenever a high-cardinality key has heavy shared features across the rows it groups:

  • Query-sorted search logs — query text + embeddings are shared across all candidate results per query.
  • Session-sorted event logs — session context is shared across all events per session.
  • Job-sorted execution logs — job parameters are shared across all tasks per job.

In each case, the sort-key choice trades storage compression + bucket-join wins against the IID-disruption tax; the correctness-fix toolkit (SyncBatchNorm, same-key masking) applies identically.

Caveats

  • Write-side sort cost is not disclosed by Pinterest — the Iceberg write pipeline has to maintain sort order, and if writes are distributed, sorting involves a shuffle.
  • Compression ratio is workload-dependent — 10–50× is for Pinterest's user-heavy feature columns; the ratio collapses if user features don't dominate the row width.
  • Column-order matters — gains concentrate on columns that repeat within each (user, request) group; naïve mixed columns may need hive-style partitioning or clustering on top of the sort to see similar ratios.
  • Sort key choice is load-bearing — sort-by-user_id alone gives duplicate-run compression but loses per-request co-location (request-level features still scatter); Pinterest uses (user_id, request_id) precisely to get both.
  • Doesn't apply to tabular models without sequence features — if rows don't share heavy columns, there's nothing to compress beyond general-purpose gains.

Seen in

Last updated · 550 distilled / 1,221 read