CONCEPT Cited by 1 source
Bucket join¶
Definition¶
A bucket join is a distributed join executed over tables whose rows are pre-partitioned and sorted by the join key into matching buckets (file groups, partitions). Because matching keys are physically co-located, the compute engine can join the data without a shuffle — each worker reads its locally-assigned buckets from both tables and performs the join in-place.
The alternative — a shuffle-based hash or sort-merge join — moves every row over the network to bring matching keys together. On large tables, shuffle is often the dominant cost (network I/O, spilling, temp storage).
The mechanic¶
Without bucket join (shuffle):
Table A rows ──(hash by key, redistribute)──┐
Table B rows ──(hash by key, redistribute)──┴── joined per-worker
^ network-heavy shuffle step
With bucket join:
Table A bucket_0 ──┐
Table B bucket_0 ──┴── joined on worker 0 (no shuffle)
Table A bucket_1 ──┐
Table B bucket_1 ──┴── joined on worker 1
...
For the buckets to align, both tables must share:
- The same join key.
- The same partitioning / bucketing function.
- The same number of buckets (or a divisor/multiple).
Substrate¶
Bucket joins require a storage layer that preserves sort / bucket metadata:
- Apache Iceberg — table metadata records sort order + partition spec; compute engines use this to plan bucket joins.
- Parquet — row groups written in sorted order persist the physical layout.
- Apache Spark with bucketed tables — sorts + buckets on write; readers use the metadata to skip shuffle.
- Hive — the historical origin of the bucketed table vocabulary.
Why Pinterest canonicalises it on the recsys wiki¶
Pinterest's request-level deduplication work sorts training datasets by user ID + request ID. The primary win is patterns/sort-by-request-id-for-columnar-compression|10–50× columnar compression. A secondary win is bucket joins:
"Bucket joins: Matching keys are co-located, eliminating expensive shuffle operations."
Pinterest's data pipelines regularly join ML training datasets against user-feature tables, request-logs, and engagement tables. Every join on user_id or request_id against the sorted Iceberg dataset avoids a shuffle step. Over exabyte-scale ML feature pipelines, the aggregate compute + network savings are significant.
When it helps¶
- Repeated joins on the same key — amortise the one-time cost of sorting / bucketing across many join operations.
- Very large tables where shuffle is the bottleneck.
- Feature engineering pipelines that join the same entity tables repeatedly as new features are added.
When it hurts¶
- Write-side cost: maintaining sort order + bucket assignment on insert / upsert is non-trivial; single-row updates in Iceberg MERGE INTO can violate bucket layout.
- Asymmetric keys: if only one table is bucketed on the join key, the other still needs to shuffle.
- Skewed keys: a hot user / request ID makes one bucket dominate, defeating the parallelism win.
- Rebucketing cost: changing the bucket count is expensive.
Caveats¶
- Engine support varies. Spark, Trino, Presto all support bucket joins over Iceberg with sort order declared; not all engines do.
- Pinterest doesn't disclose the bucket count, spec details, or the specific engines (Spark / Ray) that exploit this.
- Not the same as a broadcast join. Broadcast join replicates a small table to every worker; bucket join aligns two large tables by pre-sorting.
Seen in¶
- 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — bucket joins named as a downstream win of request-ID-sorted Iceberg training datasets, alongside columnar compression, efficient backfills, incremental feature engineering, and stratified sampling.
Related¶
- systems/apache-iceberg — the substrate that records sort / bucket metadata.
- systems/apache-parquet — the file format.
- systems/apache-spark — canonical engine that exploits bucket joins.
- patterns/sort-by-request-id-for-columnar-compression — the sibling win on the same sort order.
- concepts/request-level-deduplication — the optimisation programme that motivates the sort order.