Skip to content

PATTERN Cited by 1 source

Broadcast join for small reference tables

The broadcast join for small reference tables pattern is a Spark optimisation that eliminates expensive shuffle operations on multi-key joins with date ranges by broadcasting reference (dimension) tables under a measured size threshold to every worker instead of shuffling both sides of the join. The Octopus Energy MHHS rebuild discloses the threshold verbatim — under 500 MB — and frames it as one half of the "Join and partition tuning" category alongside liquid clustering. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Shape

default sort-merge join (rejected for small ref tables):

   large fact table  ──shuffle by join key──┐
                                          merge ─► output
   small ref table   ──shuffle by join key──┘
   (50–500 MB)
   (full network round of small data,
    expensive multi-key join with date range)


broadcast join (adopted):

   large fact table  ──no shuffle──────┐
                                    join ─► output
   small ref table   ──broadcast───────┘
   (50–500 MB)         to every executor (one-time cost)
                       (no shuffle on either side)

When to apply

Apply when all are true:

  • One side of the join is small enough to fit in executor memory after broadcasting (Octopus disclosure: <500 MB).
  • The join key is multi-column or includes date-range predicates that make standard sort-merge expensive.
  • The reference table is slowly-changing — broadcasting cost is amortised across many query runs.
  • The fact table is large enough that a shuffle on both sides is significantly more expensive than broadcasting one side.

The verbatim Octopus framing

"Broadcast joins for reference tables under 500MB, eliminating expensive shuffle operations on complex multi-key joins with date ranges."

Two specifics worth holding:

  1. The 500 MB threshold is the team's measured choice, not a universal Spark default. The default threshold is much lower (spark.sql.autoBroadcastJoinThreshold, typically 10 MB); Octopus raised it to 500 MB based on measured performance for their workload. Adjust to your own workload after measurement.
  2. The expensive case is "complex multi-key joins with date ranges" — the cost saved is shuffle on both sides plus merge on the joined output. Pure single-key joins with simple equality may not benefit as much.

Steps

  1. Measure the reference-table size. Both raw size and the in-memory broadcast representation (which can be larger than the on-disk size for compressed columnar formats).
  2. Raise spark.sql.autoBroadcastJoinThreshold to the workload-specific threshold (Octopus: 500 MB). This makes Spark automatically broadcast tables under the threshold.
  3. Add explicit broadcast() hints for borderline cases where AQE / cost-based optimiser may not choose broadcast even though measurement says it should.
  4. Verify in the Spark UI — confirm the join shows as BroadcastHashJoin, not SortMergeJoin, after the change.
  5. Compose with AQE — AQE can dynamically switch from sort-merge to broadcast at runtime when one side's post-filter size falls under the broadcast threshold. Static hints + AQE are complementary, not redundant.

Why this beats sort-merge join for the cited workload

Cost component Sort-merge join Broadcast join
Shuffle on large side Yes (entire fact table) No
Shuffle on small side Yes (entire ref table) No
Network round Both sides shuffle to common keys Small side broadcast once per worker
Sort cost Both sides sorted None
Merge cost Sequential merge on sorted streams Hash lookup on broadcast table

For a multi-key join with date-range predicates, the sort cost on the large side is what dominates; broadcast eliminates it entirely.

Trade-offs

  • Memory pressure on each executor is the cost. Broadcasting 500 MB to N executors uses 500 MB × N of broadcast variable memory. The 500 MB ceiling is a function of executor memory and concurrency.
  • Stale broadcasts — if the reference table is updated frequently, the broadcast must be invalidated. Slowly-changing dimensions are the natural fit; rapidly-changing ones may thrash.
  • Skew protection — broadcast joins eliminate shuffle skew on the join itself but don't help with downstream skew on the joined output.
  • Doesn't compose with all join types. Broadcast hash join works for inner, left/right outer (with the broadcast side on the appropriate side), and semi/anti joins. Full outer joins have more constrained broadcast support.
  • systems/liquid-clustering — the other lever in the Octopus "join and partition tuning" category. Liquid clustering on the large side of the join + broadcast on the small side is the disclosed recipe for the multi-key joins with date ranges that drive the Octopus margin pipeline.
  • systems/spark-aqe — runtime dynamic switch from sort-merge to broadcast when post-filter sizes shrink. Static threshold + AQE runtime decisions are complementary.
  • concepts/remove-before-add-optimization — the broadcast threshold itself is a measurement output. If you raised it to 500 MB three years ago and the workload has shifted, re-measure before assuming it's still right.

Seen in

Last updated · 542 distilled / 1,571 read