PATTERN Cited by 1 source
Broadcast join for small reference tables¶
The broadcast join for small reference tables pattern is a Spark optimisation that eliminates expensive shuffle operations on multi-key joins with date ranges by broadcasting reference (dimension) tables under a measured size threshold to every worker instead of shuffling both sides of the join. The Octopus Energy MHHS rebuild discloses the threshold verbatim — under 500 MB — and frames it as one half of the "Join and partition tuning" category alongside liquid clustering. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
Shape¶
default sort-merge join (rejected for small ref tables):
large fact table ──shuffle by join key──┐
▼
merge ─► output
▲
small ref table ──shuffle by join key──┘
(50–500 MB)
(full network round of small data,
expensive multi-key join with date range)
broadcast join (adopted):
large fact table ──no shuffle──────┐
▼
join ─► output
▲
small ref table ──broadcast───────┘
(50–500 MB) to every executor (one-time cost)
(no shuffle on either side)
When to apply¶
Apply when all are true:
- One side of the join is small enough to fit in executor memory after broadcasting (Octopus disclosure: <500 MB).
- The join key is multi-column or includes date-range predicates that make standard sort-merge expensive.
- The reference table is slowly-changing — broadcasting cost is amortised across many query runs.
- The fact table is large enough that a shuffle on both sides is significantly more expensive than broadcasting one side.
The verbatim Octopus framing¶
"Broadcast joins for reference tables under 500MB, eliminating expensive shuffle operations on complex multi-key joins with date ranges."
Two specifics worth holding:
- The 500 MB threshold is the team's measured choice, not a
universal Spark default. The default threshold is much lower
(
spark.sql.autoBroadcastJoinThreshold, typically 10 MB); Octopus raised it to 500 MB based on measured performance for their workload. Adjust to your own workload after measurement. - The expensive case is "complex multi-key joins with date ranges" — the cost saved is shuffle on both sides plus merge on the joined output. Pure single-key joins with simple equality may not benefit as much.
Steps¶
- Measure the reference-table size. Both raw size and the in-memory broadcast representation (which can be larger than the on-disk size for compressed columnar formats).
- Raise
spark.sql.autoBroadcastJoinThresholdto the workload-specific threshold (Octopus: 500 MB). This makes Spark automatically broadcast tables under the threshold. - Add explicit
broadcast()hints for borderline cases where AQE / cost-based optimiser may not choose broadcast even though measurement says it should. - Verify in the Spark UI — confirm the join shows as
BroadcastHashJoin, notSortMergeJoin, after the change. - Compose with AQE — AQE can dynamically switch from sort-merge to broadcast at runtime when one side's post-filter size falls under the broadcast threshold. Static hints + AQE are complementary, not redundant.
Why this beats sort-merge join for the cited workload¶
| Cost component | Sort-merge join | Broadcast join |
|---|---|---|
| Shuffle on large side | Yes (entire fact table) | No |
| Shuffle on small side | Yes (entire ref table) | No |
| Network round | Both sides shuffle to common keys | Small side broadcast once per worker |
| Sort cost | Both sides sorted | None |
| Merge cost | Sequential merge on sorted streams | Hash lookup on broadcast table |
For a multi-key join with date-range predicates, the sort cost on the large side is what dominates; broadcast eliminates it entirely.
Trade-offs¶
- Memory pressure on each executor is the cost. Broadcasting 500 MB to N executors uses 500 MB × N of broadcast variable memory. The 500 MB ceiling is a function of executor memory and concurrency.
- Stale broadcasts — if the reference table is updated frequently, the broadcast must be invalidated. Slowly-changing dimensions are the natural fit; rapidly-changing ones may thrash.
- Skew protection — broadcast joins eliminate shuffle skew on the join itself but don't help with downstream skew on the joined output.
- Doesn't compose with all join types. Broadcast hash join works for inner, left/right outer (with the broadcast side on the appropriate side), and semi/anti joins. Full outer joins have more constrained broadcast support.
Composition with related patterns¶
- systems/liquid-clustering — the other lever in the Octopus "join and partition tuning" category. Liquid clustering on the large side of the join + broadcast on the small side is the disclosed recipe for the multi-key joins with date ranges that drive the Octopus margin pipeline.
- systems/spark-aqe — runtime dynamic switch from sort-merge to broadcast when post-filter sizes shrink. Static threshold + AQE runtime decisions are complementary.
- concepts/remove-before-add-optimization — the broadcast threshold itself is a measurement output. If you raised it to 500 MB three years ago and the workload has shifted, re-measure before assuming it's still right.
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — canonical disclosure with explicit threshold (<500 MB) and the disclosed use case ("complex multi-key joins with date ranges") in the Octopus three-stream margin pipeline rebuild.
Related¶
- Patterns: patterns/grain-aligned-stream-split · patterns/cdf-incremental-replacing-full-rescan
- Systems: systems/apache-spark · systems/spark-aqe · systems/liquid-clustering · systems/octopus-margin-data-pipeline
- Concepts: concepts/remove-before-add-optimization
- Companies: companies/octopus-energy · companies/databricks