CONCEPT Cited by 1 source

Bucket join¶

Definition¶

A bucket join is a distributed join executed over tables whose rows are pre-partitioned and sorted by the join key into matching buckets (file groups, partitions). Because matching keys are physically co-located, the compute engine can join the data without a shuffle — each worker reads its locally-assigned buckets from both tables and performs the join in-place.

The alternative — a shuffle-based hash or sort-merge join — moves every row over the network to bring matching keys together. On large tables, shuffle is often the dominant cost (network I/O, spilling, temp storage).

The mechanic¶

Without bucket join (shuffle):
  Table A rows ──(hash by key, redistribute)──┐
  Table B rows ──(hash by key, redistribute)──┴── joined per-worker
                 ^ network-heavy shuffle step

With bucket join:
  Table A bucket_0 ──┐
  Table B bucket_0 ──┴── joined on worker 0 (no shuffle)
  Table A bucket_1 ──┐
  Table B bucket_1 ──┴── joined on worker 1
  ...

For the buckets to align, both tables must share:

The same join key.
The same partitioning / bucketing function.
The same number of buckets (or a divisor/multiple).

Substrate¶

Bucket joins require a storage layer that preserves sort / bucket metadata:

Apache Iceberg — table metadata records sort order + partition spec; compute engines use this to plan bucket joins.
Parquet — row groups written in sorted order persist the physical layout.
Apache Spark with bucketed tables — sorts + buckets on write; readers use the metadata to skip shuffle.
Hive — the historical origin of the bucketed table vocabulary.

Why Pinterest canonicalises it on the recsys wiki¶

Pinterest's request-level deduplication work sorts training datasets by user ID + request ID. The primary win is patterns/sort-by-request-id-for-columnar-compression|10–50× columnar compression. A secondary win is bucket joins:

"Bucket joins: Matching keys are co-located, eliminating expensive shuffle operations."

Pinterest's data pipelines regularly join ML training datasets against user-feature tables, request-logs, and engagement tables. Every join on user_id or request_id against the sorted Iceberg dataset avoids a shuffle step. Over exabyte-scale ML feature pipelines, the aggregate compute + network savings are significant.

When it helps¶

Repeated joins on the same key — amortise the one-time cost of sorting / bucketing across many join operations.
Very large tables where shuffle is the bottleneck.
Feature engineering pipelines that join the same entity tables repeatedly as new features are added.

When it hurts¶

Write-side cost: maintaining sort order + bucket assignment on insert / upsert is non-trivial; single-row updates in Iceberg MERGE INTO can violate bucket layout.
Asymmetric keys: if only one table is bucketed on the join key, the other still needs to shuffle.
Skewed keys: a hot user / request ID makes one bucket dominate, defeating the parallelism win.
Rebucketing cost: changing the bucket count is expensive.

Caveats¶

Engine support varies. Spark, Trino, Presto all support bucket joins over Iceberg with sort order declared; not all engines do.
Pinterest doesn't disclose the bucket count, spec details, or the specific engines (Spark / Ray) that exploit this.
Not the same as a broadcast join. Broadcast join replicates a small table to every worker; bucket join aligns two large tables by pre-sorting.

Seen in¶

2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — bucket joins named as a downstream win of request-ID-sorted Iceberg training datasets, alongside columnar compression, efficient backfills, incremental feature engineering, and stratified sampling.

systems/apache-iceberg — the substrate that records sort / bucket metadata.
systems/apache-parquet — the file format.
systems/apache-spark — canonical engine that exploits bucket joins.
patterns/sort-by-request-id-for-columnar-compression — the sibling win on the same sort order.
concepts/request-level-deduplication — the optimisation programme that motivates the sort order.