Skip to content

SYSTEM Cited by 1 source

Spark Adaptive Query Execution (AQE)

Adaptive Query Execution (AQE) is the runtime query optimizer inside Apache Spark — the layer that re-plans a query while it is executing using the statistics that become available between stages (post-shuffle row counts, partition sizes, skew distributions, join input sizes), rather than relying exclusively on the static plan emitted by the upfront cost-based optimizer. Its three signature actions are dynamically coalescing shuffle partitions to merge tiny partitions after a shuffle, dynamically switching join strategies (sort-merge → broadcast) when a join input turns out to be small enough, and dynamically splitting skewed partitions when one shuffle key dominates.

For this wiki AQE matters less as an optimizer-internals topic and more as a system actor in an architectural decision — the case where deleting hand-tuned optimisation code is the right move because AQE has overtaken what the static plan + manual hints used to be the only way to express. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

The architectural framing: trust the optimiser

The Octopus Energy MHHS rebuild applied four optimisation categories to its rebuilt margin data pipeline. The third category was framed verbatim as "Trusted the optimiser":

"In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job."

The action item was deletion, not addition. The reasoning is that AQE has access to runtime statistics that the human writing optimisation code at design time does not:

  • Post-shuffle row counts are precise — the static planner only has table-level cardinality estimates, often stale.
  • Skew distributions can only be observed at runtime; static hints encode the architect's guess about which keys will be hot.
  • Join input sizes after upstream filtering can shrink dramatically vs the planner's pre-filter estimate, which is what determines whether the broadcast threshold is exceeded.

When the runtime optimiser has more information than the human-coded hint, removing the hint is the higher-quality move. This is the mechanism behind concepts/remove-before-add-optimization: each existing hint is a hypothesis that runtime stats may have outgrown.

What AQE does (capabilities relevant to the cited cases)

Capability What it does Replaces
Dynamic partition coalescing After a shuffle, merge many tiny partitions into appropriately-sized ones based on observed post-shuffle byte sizes Pre-shuffle partition count tuning, manual repartition() after wide ops
Dynamic join strategy switching Convert a planned sort-merge join to a broadcast join when one side's post-filter size falls under the broadcast threshold Manual broadcast() hints based on guessed sizes
Skew join handling Detect partitions with disproportionate row counts and split them; coordinate the other side accordingly Manual salting, custom shuffle logic

The Octopus rebuild explicitly retains a broadcast join pattern for reference tables under 500 MB alongside trusting AQE — the two aren't substitutes. The pattern is for cases where the architect knows the table is small at the architecture level (a slowly- changing reference dimension); AQE is for cases where the runtime size is the only reliable signal.

When trusting AQE is the right move (and when it isn't)

The article's hedge is precise: "in several cases" AQE outperformed hand-tuning, not always. The principle is measure first, then choose — but the default should be AQE, with hand-tuning earning its place via measurement.

Symptom Action
Existing custom optimisation code outperforms AQE on measurement Keep the code; document why
AQE matches or outperforms the custom code Delete the custom code; let AQE drive
No measurement either way Default to AQE; the burden of proof is on the hand-tuning

This inverts the historical Spark culture where elaborate hand-tuning was the mark of an experienced operator. The reframing: hand-tuning is a hypothesis with maintenance cost, and the optimiser improving over time means previously-justified hand-tuning can become net-negative without anyone noticing.

Composition with other Octopus rebuild moves

AQE's value compounds with the other levers in the Octopus pipeline:

  • Lineage simplification + early column/row pruning make the per-stage byte volumes that AQE sees smaller — which means more joins flip into broadcast strategy, and more partition coalescing is helpful.
  • Liquid clustering (systems/liquid-clustering) cuts the pre-shuffle skew that AQE has to handle in the first place — fewer skew-split events at runtime.
  • CDF-based incremental processing (patterns/cdf-incremental-replacing-full-rescan) further shrinks per-run byte volume — the same downstream join may now be small enough to broadcast every run.

The "trust AQE" move is multiplicative with each of these, not additive — every byte the upstream stages don't produce is a byte AQE doesn't have to optimise around.

Seen in

Last updated · 542 distilled / 1,571 read