Skip to content

CONCEPT Cited by 1 source

Bucket join

Definition

A bucket join is a distributed join executed over tables whose rows are pre-partitioned and sorted by the join key into matching buckets (file groups, partitions). Because matching keys are physically co-located, the compute engine can join the data without a shuffle — each worker reads its locally-assigned buckets from both tables and performs the join in-place.

The alternative — a shuffle-based hash or sort-merge join — moves every row over the network to bring matching keys together. On large tables, shuffle is often the dominant cost (network I/O, spilling, temp storage).

The mechanic

Without bucket join (shuffle):
  Table A rows ──(hash by key, redistribute)──┐
  Table B rows ──(hash by key, redistribute)──┴── joined per-worker
                 ^ network-heavy shuffle step

With bucket join:
  Table A bucket_0 ──┐
  Table B bucket_0 ──┴── joined on worker 0 (no shuffle)
  Table A bucket_1 ──┐
  Table B bucket_1 ──┴── joined on worker 1
  ...

For the buckets to align, both tables must share:

  • The same join key.
  • The same partitioning / bucketing function.
  • The same number of buckets (or a divisor/multiple).

Substrate

Bucket joins require a storage layer that preserves sort / bucket metadata:

  • Apache Iceberg — table metadata records sort order + partition spec; compute engines use this to plan bucket joins.
  • Parquet — row groups written in sorted order persist the physical layout.
  • Apache Spark with bucketed tables — sorts + buckets on write; readers use the metadata to skip shuffle.
  • Hive — the historical origin of the bucketed table vocabulary.

Why Pinterest canonicalises it on the recsys wiki

Pinterest's request-level deduplication work sorts training datasets by user ID + request ID. The primary win is patterns/sort-by-request-id-for-columnar-compression|10–50× columnar compression. A secondary win is bucket joins:

"Bucket joins: Matching keys are co-located, eliminating expensive shuffle operations."

Pinterest's data pipelines regularly join ML training datasets against user-feature tables, request-logs, and engagement tables. Every join on user_id or request_id against the sorted Iceberg dataset avoids a shuffle step. Over exabyte-scale ML feature pipelines, the aggregate compute + network savings are significant.

When it helps

  • Repeated joins on the same key — amortise the one-time cost of sorting / bucketing across many join operations.
  • Very large tables where shuffle is the bottleneck.
  • Feature engineering pipelines that join the same entity tables repeatedly as new features are added.

When it hurts

  • Write-side cost: maintaining sort order + bucket assignment on insert / upsert is non-trivial; single-row updates in Iceberg MERGE INTO can violate bucket layout.
  • Asymmetric keys: if only one table is bucketed on the join key, the other still needs to shuffle.
  • Skewed keys: a hot user / request ID makes one bucket dominate, defeating the parallelism win.
  • Rebucketing cost: changing the bucket count is expensive.

Caveats

  • Engine support varies. Spark, Trino, Presto all support bucket joins over Iceberg with sort order declared; not all engines do.
  • Pinterest doesn't disclose the bucket count, spec details, or the specific engines (Spark / Ray) that exploit this.
  • Not the same as a broadcast join. Broadcast join replicates a small table to every worker; bucket join aligns two large tables by pre-sorting.

Seen in

Last updated · 319 distilled / 1,201 read