CONCEPT

Push aggregation under join¶

Definition¶

Pushing aggregation under a join is a query-planner rewrite that moves a GROUP BY + aggregate operator from above a join (where it sees the cartesian / nested-loop product of both sides) to below one or both sides of the join (where it sees only that side's rows). The rewrite preserves algebraic correctness by carrying an explicit multiplier column (typically count(*) on the pre-aggregated side) that reconstructs the collapsed row counts at post-join projection time.

The motivation: joins are expensive (especially cross-shard nested-loop joins that issue one RHS query per LHS row). Aggregation before the join shrinks the LHS row count by the group-collapsing factor — sometimes by 1000× or more — so the join issues 1000× fewer RHS queries.

The foundational paper¶

Cesar A. Galindo-Legaria and Milind M. Joshi, Orthogonal Optimization of Subqueries and Aggregation, SIGMOD 2001, Microsoft Research. The paper generalises a family of rewrites for moving aggregation around joins, subqueries, and correlated predicates while maintaining algebraic equivalence. The canonical insight: "in some cases it's preferable to do aggregation before performing joins — this could save on how much work the join operator had to do and so lowered the total cost of the plan."

Vitess's implementation (vitessio/vitess #9643) adapts the paper's algorithm to the specific case of cross-shard nested-loop joins at VTGate.

The algebraic trick¶

Consider: SELECT order.office, sum(order_line.amount) FROM order JOIN order_line ON order.id = order_line.order_id GROUP BY order.office.

Naive plan: Join first. Produces one row per matched (order, order_line) pair. Then SUM + GROUP BY office at the top.
Pushed plan: Pre-aggregate LHS (order) to (office, id, count(*)) — collapsing multiple rows per (office, id) into one row plus a count. Issue the join on the pre-aggregated result. Collect RHS SUM(amount). Then at post-join Project, multiply SUM(amount) * count(*) to restore the row-count factor. Then SUM + GROUP BY office at the top.

The count(*) column is the key. Without it, pre-aggregating LHS silently produces wrong answers for SUM on the RHS because the join's row-matching implicitly counted multiplicity that the pre-aggregation erased. With it, post-join Project(SUM(amount) * count(*)) reconstructs the correct per-group contribution.

The Vitess worked plan¶

From Andres Taylor's 2022-06-24 post (Source: ), reading the query back-to-front:

LHS Route: scatter SELECT order.office, order.id, count(*) FROM order GROUP BY order.office, order.id to every shard of the order keyspace. Each shard returns pre-aggregated rows.
NestedLoopJoin: for each LHS row, bind order_line.order_id = ? and issue the RHS query.
RHS Route: single-shard or scatter (depending on sharding) SELECT sum(order_line.amount) FROM order_line WHERE order_line.order_id = ?.
Project: sum(order_line.amount) * count(*) using VTGate's evalengine.
GroupBy: SUM(sum(order_line.amount) * count(*)) grouped by order.office at VTGate — the global aggregation step.

Every node is a concrete Vitess executor primitive; no new infrastructure was needed beyond the planner rewrite rule.

When it's worth it¶

The win is proportional to the group-collapsing factor of the LHS pre-aggregation:

High win: GROUP BY is on a low-cardinality column (e.g. office with 100 distinct values) and LHS has 1M orders. Pre-aggregation collapses 1M rows into ~1000 (office, id) groups (assuming each office has many orders with distinct order.id). NestedLoopJoin issues 1000 RHS queries instead of 1M.
Zero win: GROUP BY is on a high-cardinality column (e.g. order.id itself, which is unique per row). Pre-aggregation collapses nothing. NestedLoopJoin still issues 1M RHS queries. The rewrite is pure overhead (one extra aggregation step).
Negative win: if the LHS's GROUP BY key has higher cardinality than raw rows (impossible structurally), pre-aggregation would cost more than it saves. Not a real case.

The query planner needs either (a) table statistics to estimate the group-collapsing factor, (b) heuristic rules (always apply if GROUP BY includes a non-PK column), or (c) rule-based rewriting with cost-model verification. Vitess's approach is not explicitly disclosed in the post.

Relationship to local-global aggregation¶

concepts/local-global-aggregation-decomposition is the foundational primitive: every aggregate splits into a per-shard local + coordinator global. This concept is the extension when joins are present: when an aggregate is computed over a join whose two sides can't be colocated, the local-global split has to be applied before the join on each side, with the multiplier column bookkeeping to restore algebraic equivalence.

The composition is: local aggregation on LHS + local aggregation on RHS + nested-loop join + Project (multiplier arithmetic) + global aggregation — five operators, all of which are VTGate / MySQL primitives.

Generalisations¶

The paper generalises far beyond the two-way-NL-join case Vitess's post illustrates. Related rewrites:

Eager group-by + lazy group-by (Yan & Larson, 1994) — moving group-by up or down a join tree based on functional dependencies.
Semi-join reduction — if the outer side only needs RHS existence, an outer-side group-by + semi-join can replace the full join.
Subquery decorrelation — the paper's other major topic; rewrites correlated subqueries as joins that can be optimised with the same group-by-pushdown machinery.

Vitess has implemented the NL-join case; the post doesn't discuss whether/how Vitess handles the other rewrites in the paper family.

Why cross-shard systems benefit more than single-node systems¶

The cost of issuing one RHS query in a single-node NL join is microseconds (cache hit on a B-tree). The cost of issuing one cross-shard RHS query is milliseconds (network hop + query parse + plan + execute + return). The group-collapsing factor's impact is therefore ~1000× more consequential in cross-shard NL joins than in single-node NL joins. This is the specific reason a 20-year-old SIGMOD algorithm was worth re-harvesting for a 2022 Vitess production implementation: the economic regime is different enough that the rewrite rule's value moved from "marginal" to "load-bearing."

Seen in¶

— canonical wiki introduction. Andres Taylor (PlanetScale / Vitess core, 2022-06-24) describes Vitess's planner extension to handle GROUP BY + aggregation under cross-shard joins, crediting Galindo-Legaria & Joshi's SIGMOD 2001 paper. Shipped as vitessio/vitess #9643.
— production-incident retrospective on the same rewrite. Andres Taylor (2024-07-22) narrates an OOM in VTGate caused by a peer rewriter ("push ordering under aggregation") firing before this pushdown could happen, wedging Ordering between Aggregator and ApplyJoin. The pushdown rewrite itself is unchanged; the fix is a phase gate — delay the ordering rewriter until the split-aggregation phase so this pushdown can fire first. Canonical wiki disclosure of the OOM failure mode when this rewrite is reachable but blocked by peer-rewriter interference. Shipped as vitessio/vitess #16278.

concepts/local-global-aggregation-decomposition — the foundational primitive this concept extends.
concepts/nested-loop-join — the specific VTGate join shape where the rewrite pays off.
concepts/cross-shard-query — the regime where the rewrite moves from marginal to load-bearing.
concepts/vtgate-query-planner — the Vitess component that implements the rewrite.
patterns/aggregation-pushdown-under-join — the canonical production-side pattern.
patterns/local-global-aggregation-split — the simpler foundational pattern this extends.
systems/vitess
systems/vitess-evalengine — evaluates the Project operator's multiplier arithmetic.
systems/mysql
concepts/planner-phase-ordering
concepts/fixed-point-tree-rewriting
patterns/phase-gated-planner-rewriter