PATTERN

Aggregation pushdown under join¶

Intent¶

When a query has GROUP BY + aggregate operators above a cross-shard join that cannot be pushed down, rewrite the plan so that each side of the join is pre-aggregated before the join runs. Carry an explicit multiplier column (count(*)) to preserve algebraic equivalence, and reapply the multiplication at post-join Project time. The pre-aggregation's group-collapsing factor directly reduces the join's inner-loop query count.

Implements the algorithm from Cesar A. Galindo-Legaria and Milind M. Joshi, Orthogonal Optimization of Subqueries and Aggregation, SIGMOD 2001 (Microsoft). Shipped in Vitess as vitessio/vitess #9643.

Context¶

Sharded databases frequently have to execute cross-shard joins in the proxy tier when the two tables being joined don't share a shard key. The canonical join shape in this regime is a nested-loop join — for each LHS row, issue one RHS query over the network. This makes the join's cost linear in LHS cardinality × per-query RTT.

Example (Source: ):

order sharded by id, order_line sharded by its own id (not order_id).
Query: SELECT order.office, sum(order_line.amount) FROM order JOIN order_line ON order.id = order_line.order_id GROUP BY order.office.

The join cannot be pushed down. With 1M orders, naive NL join issues 1M RHS queries. If the final query is grouped by office (100 distinct values), the output is only 100 rows — suggesting most of the join work is wasted.

Solution¶

Rewrite the plan so that LHS is pre-aggregated by (GROUP BY key, join key) — in the example, (office, order.id) — and carries a count(*) column tracking the group's row count. The plan becomes:

LHS Route: scatter SELECT order.office, order.id, count(*) FROM order GROUP BY order.office, order.id → each shard aggregates locally, returns one row per (office, id) pair.
NestedLoopJoin: for each LHS row, bind order_line.order_id = ? and issue RHS query.
RHS Route: SELECT sum(order_line.amount) FROM order_line WHERE order_line.order_id = ?.
Project: sum(order_line.amount) * count(*) using evalengine — the algebraic correction that restores row-count multiplicity lost in LHS pre-aggregation.
GroupBy: SUM(sum(order_line.amount) * count(*)) GROUP BY office at VTGate — the global aggregation step.

Taylor's worked example output:

LHS row (office, id, count)	RHS sum(amount)	Project (sum × count)
(1, 1, 2)	5	10
(1, 1, 2)	3	6
(2, 2, 3)	10	30
(2, 2, 3)	7	21

Top GroupBy:

office	SUM(sum × count)
1	16
2	51

The same result that a naive post-join aggregation would produce — but with the expensive join step operating on ~1000× fewer LHS rows for realistic workloads.

The algebraic trick¶

The count(*) multiplier is load-bearing. Without it, pre-aggregating LHS erases the multiplicity that the join's row-for-row matching implicitly counted. The Project's sum × count multiplication reconstructs the per-group contribution that would have come from expanding the pre-aggregated rows back out before the join.

More formally: if LHS has rows (g1, j1), (g1, j1), (g1, j1) (three occurrences, same (group-key, join-key)), and RHS matching j1 has aggregate value S, the naive post-join plan sees three copies of S and sums to 3S. The pre-aggregated plan sees one LHS row (g1, j1, count=3) and one RHS value S; the Project computes S × 3 = 3S. Algebraic equivalence.

Forces¶

Cross-shard NL joins are expensive: every LHS row triggers one RHS network query.
Pre-aggregation is cheap per shard: MySQL aggregates locally much faster than VTGate can join.
The count(*) column carries exactly the information lost by pre-aggregation: one scalar per group.
Planner rewrite is algebraic, not heuristic: the rewrite always preserves correctness; the question is only whether it's cheaper.
The win depends on group-collapsing factor: original LHS rows / pre-aggregated LHS rows. Low group-cardinality → big win; high group-cardinality → overhead.

Canonical instance¶

Vitess VTGate planner — vitessio/vitess #9643, shipped in a 2022 Vitess release. Andres Taylor's 2022-06-24 post is the canonical wiki disclosure. This is the production instantiation of Galindo-Legaria & Joshi's 2001 algorithm — 21 years from SIGMOD paper to Vitess production.

Consequences¶

+ Cross-shard NL join cost drops by the group-collapsing factor: 1M LHS rows collapsed to 1000 groups → 1000× fewer RHS queries.
+ Reuses existing Vitess operators: every node (Route, NestedLoopJoin, Project, GroupBy) is a pre-existing VTGate executor primitive. Only the planner rewrite was new work.
+ Composes with local-global aggregation: each shard's local GROUP BY for the LHS pre-aggregation is itself a local-global application.
− Planner decision requires cost model or heuristics: if group-cardinality ≈ row-cardinality, the rewrite is pure overhead. The planner needs statistics or rules to decide when to apply.
− Only helps when aggregation is above a non-pushdown join: if the join can be pushed down (shard keys align), this whole rewrite is unnecessary. If there's no join, the simpler patterns/local-global-aggregation-split suffices.
− AVG and COUNT(DISTINCT) need care: AVG decomposition carries both sum and count; COUNT(DISTINCT) requires either set-ship (expensive) or sketch-merge (approximate).

Extensions¶

Pushdown under N-way joins: iteratively apply to each side of a join tree.
Pushdown through subqueries: Galindo-Legaria & Joshi's paper's primary focus; Vitess hasn't publicly disclosed subquery-form pushdown implementation status.
Cost-based planner application: use table statistics to estimate group-collapsing factor and decide whether to apply.

patterns/local-global-aggregation-split — the simpler foundational pattern this extends. Every cross-shard aggregate uses local-global split; only cross-shard-join+aggregate queries need the pushdown-under-join extension.
patterns/research-to-production-algorithm-adoption — the meta-pattern of adopting academic query-optimisation work in production database systems.

Seen in¶

— canonical wiki introduction. Andres Taylor walks through the rewrite step-by-step for the order × order_line worked example, crediting Galindo-Legaria & Joshi's SIGMOD 2001 paper. Closes with the meta-claim: "This experience is one I've had many times in the past. Someone out there has done a ton of work on something closely related to what we are doing, and all we have to do is adapt the algorithm to our circumstances."
— two-years-later bug retrospective on this pattern's reachability. Andres Taylor (2024-07-22) narrates a production OOM caused by a peer rewriter ("push ordering under aggregation") wedging Ordering between Aggregator and ApplyJoin in the initial phase, blocking this pattern from firing when the query added ORDER BY on the RHS of the join. The pattern itself is unchanged; the fix is a phase gate on the interfering rewriter. Canonical wiki disclosure that this pattern's value depends on planner-phase design, not just the algorithm — a correct-but-unreachable rewrite is worth zero. Shipped as vitessio/vitess #16278.