Skip to content

PLANETSCALE 2024-07-22

Read original ↗

PlanetScale — Optimizing aggregation in the Vitess query planner

Summary

Andres Taylor (PlanetScale / Vitess core, 2024-07-22) publishes a bug retrospective on the VTGate query planner: a user-reported query was fetching massive result sets and triggering out-of-memory (OOM) errors because the planner couldn't push aggregation under a join when the query also carried an ORDER BY on the RHS of the join. The fix is not a new algorithm but a planner-phase reordering: delay the "push ordering under aggregation" rewriter until after the "split aggregation" phase has run, so that aggregation pushdown through the join happens first and ordering pushdown sees an already-decomposed plan. The post canonicalises two net-new wiki primitives — planner phase ordering (rewriters grouped into sequential phases; some only activate in later phases) and fixed-point tree rewriting (rewriters run repeatedly within a phase until no further changes occur) — and a meta-pattern, phase-gated planner rewriter (when two rewriters interfere, gate the blocking one behind a later phase rather than merging their logic). Shipped as vitessio/vitess #16278. Sequel disclosure to the 2022-06-24 Grouping and aggregations on Vitess post: same mechanism (patterns/aggregation-pushdown-under-join per Galindo-Legaria & Joshi 2001), new failure mode, same author, two years later.

The triggering query:

select sum(user.type)
from user
    join user_extra on user.team_id = user_extra.id
group by user_extra.id
order by user_extra.id

Taylor's framing: "The planner was unable to delegate aggregation to MySQL, leading to the fetching of a significant amount of data for aggregation." The problem is architectural — the planner's pushdown discipline (per the 2022 post, "we want to delegate as much work as possible to MySQL") is being violated by a specific rewriter interaction, not a missing algorithm.

The planner's phase model

The post canonicalises the planner's multi-phase structure explicitly for the first time on the wiki:

"During the planning phase, we perform extensive tree rewriting to push as much work down under Routes as possible. This involves repeatedly rewriting the tree until no further changes occur during a full pass of the tree, a state known as the fixed-point. The goal of this rewriting process is to optimize query execution by pushing operations closer to the data."

Two load-bearing primitives:

  1. Fixed-point rewriting within a phase: the rewriter loop runs until a full tree pass produces no changes. This is concepts/fixed-point-tree-rewriting — the convergence discipline that lets multiple orthogonal rewrites compose without manual orchestration.
  2. Sequential phases: "We have several phases that run sequentially. After completing a phase, we run the push-down rewriters, then move to the next phase, and so on." Each phase has its own active rewriter set; the phase order determines which rewrites fire first.

Rewriters themselves perform one of two roles:

"Running a rewriter over the plan to perform a specific task. For example, the 'pull DISTINCT from UNION' rewriter extracts the DISTINCT part from UNION and uses a separate operator for it." "Controlling when push-down rewriters are enabled. Some rewriters only turn on after reaching a certain phase."

The second kind is the gate — a rewriter whose job is to enable / disable another rewriter based on phase state. This is the control-flow mechanism the fix exploits.

The bug: ordering blocks aggregation pushdown

The initial plan (post-horizon-expansion) for the triggering query:

Ordering (user_extra.id)
└── Aggregator (ORG sum(user.type), user_extra.id group by user_extra.id)
    └── ApplyJoin on [user.team_id | :user_team_id = user_extra.id | user.team_id = user_extra.id]
        ├── Route (Scatter on user)
        │   └── Table (user.user)
        └── Route (Scatter on user)
            └── Filter (:user_team_id = user_extra.id)
                └── Table (user.user_extra)

The planner can't push aggregation through the join in the initial phase (the load-bearing constraint quoted verbatim: "We don't split aggregation between MySQL and VTGate in the initial phases, so we couldn't immediately push down the aggregation through the join"). But the "push ordering under aggregation" rewriter does fire, producing:

Aggregator (ORG sum(user.type), user_extra.id group by user_extra.id)
└── Ordering (user_extra.id)
    └── ApplyJoin on user.team_id = user_extra.id
    ...

Now Ordering sits between Aggregator and ApplyJoin. And — the critical move: "We can't push the ordering further down since it's sorted by the right hand side of the join. Ordering can only be pushed down to the left hand side."

The consequence: Ordering is now wedged between the aggregation and the join, and won't move. When the later "split aggregation" phase runs, the aggregation would like to push its local halves under each side of the join — but an Ordering node blocks the tree-rewrite path. Taylor: "ordering is blocking the aggregator from being pushed down, which means we have to fetch all that data, and sort it to do the aggregation."

The result: scatter-fetch every row from both sides of the join to VTGate, sort in VTGate, aggregate in VTGate. For a large table, this OOMs VTGate.

The fix: delay the ordering rewriter by one phase

Taylor's framing of the solution verbatim:

"The solution I typically use in these situations involves leveraging the phases we have in the planner. … By delaying the 'ordering under aggregation' rewriter until the 'split aggregation' phase, we can push down the aggregation under the join. This doesn't stop the 'ordering under aggregation' rewriter from doing its job, it just has to wait a bit before doing it."

The fix is not a new rewriter and not a modification to the existing rewriter's logic. It's a phase gate — move the "ordering under aggregation" rewriter from the initial phase to the "split aggregation" phase. This is the canonical instance of patterns/phase-gated-planner-rewriter.

With the fix, the final plan becomes:

Aggregator (sum(user.type) group by user_extra.col)
└── Projection (sum(user.type) * count(*), user_extra.col)
    └── Ordering (user_extra.col)
        └── ApplyJoin (on [user.team_id = user_extra.id])
            ├── Route (Scatter on user)
            │   └── Aggregator (sum(type) group by team_id)
            │       └── Table (user)
            └── Route (Scatter on user_extra)
                └── Aggregator (count(*) group by user_extra.col)
                    └── Filter (:user_team_id = user_extra.id)
                        └── Table (user_extra)

Taylor: "Most of the aggregation has been pushed down to MySQL, and at the VTGate level, we are left with only SUMming the SUMs we get from each shard." Every shard does its own local aggregation; VTGate does the global sum. The Projection (sum × count(*)) is the Galindo-Legaria & Joshi multiplier from the patterns/aggregation-pushdown-under-join pattern — same mechanism as the 2022 post, now actually reachable because the phase gate cleared the blocker.

Key takeaways

  1. Planner bugs at scale are phase-interaction bugs, not algorithm bugs. The Galindo-Legaria & Joshi algorithm was correctly implemented in 2022; this 2024 incident is purely about rewriter phase ordering — when the blocker fires relative to the enabler. Canonical new framing: complex planners are not flat rewriter soup, they are phase-ordered fixed-point loops, and bugs live at the phase boundaries (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner).

  2. Phase gates are the cheap escape hatch for rewriter interference. Rather than making the two rewriters aware of each other (a deep invasive fix) or rewriting the algorithm (an even deeper one), the patch moves one rewriter to a later phase. "This doesn't stop the 'ordering under aggregation' rewriter from doing its job, it just has to wait a bit before doing it." Canonical patterns/phase-gated-planner-rewriter pattern. (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner)

  3. Fixed-point rewriting within a phase + sequential phases is a specific, load-bearing planner architecture. Each phase has its own active rewriter set and runs to convergence before handing off. (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner)

  4. Some rewriters don't transform, they gate. "Controlling when push-down rewriters are enabled. Some rewriters only turn on after reaching a certain phase." Canonical new framing of planner-as-control-flow-graph. (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner)

  5. OOM at VTGate is the failure mode for a planning miss. When the planner can't push aggregation down, VTGate has to pull every row from every shard into memory to aggregate. "[The query] was causing VTGate to fetch a large amount of data, sometimes resulting in an Out Of Memory (OOM) error." Canonical wiki disclosure of the OOM-at-VTGate failure mode and its direct causal link to pushdown failures. (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner)

  6. ORDER BY on the RHS of a join is an irreducible pushdown blocker. "Ordering can only be pushed down to the left hand side." Architectural asymmetry of the nested-loop / apply-join shape — only LHS state survives the iteration. This is the structural reason the bug existed. (Source: sources/2026-04-21-planetscale-optimizing-aggregation-in-the-vitess-query-planner)

  7. Two-year arc from algorithm disclosure to refinement incident. 2022-06-24 Taylor post = canonical algorithm adoption (Galindo-Legaria & Joshi 2001 → Vitess #9643). 2024-07-22 Taylor post = production bug in the algorithm's interaction with peer rewriters → phase-gate fix (Vitess #16278). Same author, same mechanism, two years apart — a clean retrospective on a planner feature at production scale.

Operational numbers

  • Vitess PR #16278 — the fix shipped.
  • OOM — specific failure mode on VTGate when aggregation can't push down and result-set materialisation exceeds memory.
  • Two sequential planner phases named in the post: an initial phase and a split aggregation phase. The post implies more phases exist ("We have several phases that run sequentially") but doesn't enumerate them.

Caveats

  • No benchmark numbers published. The post narrates the failure mode qualitatively ("fetching of a significant amount of data for aggregation") without quantifying pre-fix row count, memory consumption, or post-fix improvement. Unlike the 2024-08 Dicken IOPS post or the 2024-09 Noach Instant-Deploy post, this one stays strictly qualitative.
  • Phase enumeration incomplete. Two phases named (initial, split aggregation); the full phase list is not disclosed. This leaves open whether other pushdown-blocking interactions exist that the post didn't surface.
  • No algorithmic depth beyond the 2022 post. The Galindo-Legaria & Joshi multiplier mechanism (Projection(sum × count(*))) is shown in the final plan without re-derivation — the post is a sequel that assumes patterns/aggregation-pushdown-under-join as canonical substrate.
  • PR link provided; internal review / test-case discussion absent. The post links to vitessio/vitess #16278 "for more details on the implementation" but doesn't walk the code-level change (which phase enum value moved where).
  • Query shape specificity. The bug triggers on a specific shape: JOIN + GROUP BY + ORDER BY where ORDER BY is on the RHS of the join. Generalisation to other rewriter interactions is implied by the meta-framing but not demonstrated with additional cases.
  • No disclosure of customer incident severity / blast radius. "A user reported" is all we get — no indication of whether this was a production-impacting incident, an internal fleet-wide retrofit, or a single-customer edge case.
  • Short-form post. Body length is ~1,200 words — substantially shorter than the 2022 sibling. The sequel character is explicit ("For a deeper understanding of grouping and aggregations on Vitess, I recommend reading this prior blog post") and the post relies on that back-reference for the full mechanism.

Systems extracted

  • systems/vitess — the planner itself
  • systems/mysql — the shard-local executor the planner is optimising to delegate to
  • systems/vitess-evalengine — the VTGate-side expression evaluator that runs the Projection(sum × count(*)) arithmetic in the fixed plan

Concepts extracted

Patterns extracted

Source

Last updated · 378 distilled / 1,213 read