SYSTEM Cited by 1 source
Spark Adaptive Query Execution (AQE)¶
Adaptive Query Execution (AQE) is the runtime query optimizer inside Apache Spark — the layer that re-plans a query while it is executing using the statistics that become available between stages (post-shuffle row counts, partition sizes, skew distributions, join input sizes), rather than relying exclusively on the static plan emitted by the upfront cost-based optimizer. Its three signature actions are dynamically coalescing shuffle partitions to merge tiny partitions after a shuffle, dynamically switching join strategies (sort-merge → broadcast) when a join input turns out to be small enough, and dynamically splitting skewed partitions when one shuffle key dominates.
For this wiki AQE matters less as an optimizer-internals topic and more as a system actor in an architectural decision — the case where deleting hand-tuned optimisation code is the right move because AQE has overtaken what the static plan + manual hints used to be the only way to express. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
The architectural framing: trust the optimiser¶
The Octopus Energy MHHS rebuild applied four optimisation categories to its rebuilt margin data pipeline. The third category was framed verbatim as "Trusted the optimiser":
"In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job."
The action item was deletion, not addition. The reasoning is that AQE has access to runtime statistics that the human writing optimisation code at design time does not:
- Post-shuffle row counts are precise — the static planner only has table-level cardinality estimates, often stale.
- Skew distributions can only be observed at runtime; static hints encode the architect's guess about which keys will be hot.
- Join input sizes after upstream filtering can shrink dramatically vs the planner's pre-filter estimate, which is what determines whether the broadcast threshold is exceeded.
When the runtime optimiser has more information than the human-coded hint, removing the hint is the higher-quality move. This is the mechanism behind concepts/remove-before-add-optimization: each existing hint is a hypothesis that runtime stats may have outgrown.
What AQE does (capabilities relevant to the cited cases)¶
| Capability | What it does | Replaces |
|---|---|---|
| Dynamic partition coalescing | After a shuffle, merge many tiny partitions into appropriately-sized ones based on observed post-shuffle byte sizes | Pre-shuffle partition count tuning, manual repartition() after wide ops |
| Dynamic join strategy switching | Convert a planned sort-merge join to a broadcast join when one side's post-filter size falls under the broadcast threshold | Manual broadcast() hints based on guessed sizes |
| Skew join handling | Detect partitions with disproportionate row counts and split them; coordinate the other side accordingly | Manual salting, custom shuffle logic |
The Octopus rebuild explicitly retains a broadcast join pattern for reference tables under 500 MB alongside trusting AQE — the two aren't substitutes. The pattern is for cases where the architect knows the table is small at the architecture level (a slowly- changing reference dimension); AQE is for cases where the runtime size is the only reliable signal.
When trusting AQE is the right move (and when it isn't)¶
The article's hedge is precise: "in several cases" AQE outperformed hand-tuning, not always. The principle is measure first, then choose — but the default should be AQE, with hand-tuning earning its place via measurement.
| Symptom | Action |
|---|---|
| Existing custom optimisation code outperforms AQE on measurement | Keep the code; document why |
| AQE matches or outperforms the custom code | Delete the custom code; let AQE drive |
| No measurement either way | Default to AQE; the burden of proof is on the hand-tuning |
This inverts the historical Spark culture where elaborate hand-tuning was the mark of an experienced operator. The reframing: hand-tuning is a hypothesis with maintenance cost, and the optimiser improving over time means previously-justified hand-tuning can become net-negative without anyone noticing.
Composition with other Octopus rebuild moves¶
AQE's value compounds with the other levers in the Octopus pipeline:
- Lineage simplification + early column/row pruning make the per-stage byte volumes that AQE sees smaller — which means more joins flip into broadcast strategy, and more partition coalescing is helpful.
- Liquid clustering (systems/liquid-clustering) cuts the pre-shuffle skew that AQE has to handle in the first place — fewer skew-split events at runtime.
- CDF-based incremental processing (patterns/cdf-incremental-replacing-full-rescan) further shrinks per-run byte volume — the same downstream join may now be small enough to broadcast every run.
The "trust AQE" move is multiplicative with each of these, not additive — every byte the upstream stages don't produce is a byte AQE doesn't have to optimise around.
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — canonical disclosure of trust-the-optimiser as an architectural move: the Octopus margin pipeline rebuild explicitly removed custom optimisation code because "in several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic". Paired with concepts/remove-before-add-optimization as the generalised principle.