Skip to content

PLANETSCALE 2022-06-24

Read original ↗

PlanetScale — Grouping and aggregations on Vitess

Summary

Andres Taylor (PlanetScale / Vitess core, 2022-06-24) describes how Vitess's VTGate query planner was extended to handle GROUP BY + aggregations in queries that include joins whose data cannot be pushed to a single MySQL shard. The load-bearing framing: split every aggregate into a local part (evaluated per-shard by MySQL) and a global part (evaluated by VTGate on the merged results) — and then, crucially, use Galindo-Legaria and Joshi's Orthogonal Optimization of Subqueries and Aggregation (Microsoft, 2001) algorithm to push the local aggregation under the nested-loop join, so that each side of the join pre-aggregates before the expensive VTGate join step runs. Shipped as vitessio/vitess #9643. The story's meta-claim is about research-to-production: keeping up with academia is cost-effective because "Someone out there has done a ton of work on something closely related to what we are doing, and all we have to do is adapt the algorithm to our circumstances."

Taylor's own framing verbatim:

"Vitess is not just a dumb proxy layer though — it can also run some of the operations instead of sending them on. We want to delegate as much work as possible to MySQL — it is much faster than Vitess at doing all the database operations, including grouping and aggregation. When possible, we want work to be done there. While planning a query, the planner tries pushing as much work down to MySQL as possible."

The canonical simple case was already solved before this project: SELECT count(*) FROM user on a sharded user table becomes a per-shard count(*) (local) plus a SUM of those counts at VTGate (global). The new work extends this split to queries with cross-shard joins — the case Vitess had to execute at VTGate regardless, because the shard key of one side didn't align with the shard key of the other.

The worked example pins the mechanism concretely. Two tables:

  • order (sharded by id) — "each order comes from a single office."
  • order_line (sharded by its own id, not by order_id) — "each order can have one or more order_line corresponding rows."

"order is [sharded] by id, and order_line is sharded by its own id. If it was sharded by order_id, the join could be pushed down to MySQL, since we would know that the corresponding rows existed in the same shard. Unfortunately, it isn't, so we will have to do joins between these two tables at the VTGate level."

Query: SELECT order.office, sum(order_line.amount) FROM order JOIN order_line ON order.id = order_line.order_id GROUP BY order.office. The planner (reading the query back-to-front) builds a nested-loop join plan:

  1. LHS route: scatter query against the order keyspace selecting order.office, order.id, and count(*) grouped by order.office, order.id. The order.id is retained in the group key because it's the join key; count(*) tracks how many pre-aggregated rows each group represents.
  2. RHS route: issued per row emerging from LHS, selecting sum(order_line.amount) with order_line.order_id = ? bound.
  3. VTGate Project operator: multiplies sum(order_line.amount) from RHS by LHS's count(*) — correcting for the group-count that the LHS pre-aggregation collapsed.
  4. VTGate group-by: SUM(SUM(order_line.amount) * count(*)) grouped by order.office — the global aggregation step.

The non-obvious move is step 3's multiplier. If the LHS had been left un-aggregated (one row per order.id), the join would multiply each RHS sum(order_line.amount) by the row count it's joined against (1 per order.id) — correct by construction. Once the LHS is pre-aggregated, the multiplier is lost from the row count and must be carried forward as an explicit column (count(*)) and reapplied at projection time. This is the specific transformation Galindo-Legaria & Joshi's algorithm licenses: local aggregation + explicit count column + projection-time multiplication = algebraically equivalent to post-join aggregation, but cheaper because the join sees fewer rows on the LHS.

The evalengine is named as the VTGate-side component that evaluates the Project operator's arithmetic (SUM × count(*)) locally — "it allows the use of the evalengine mentioned above to evaluate arithmetic operations at the vtgate level." Cross-reference to the 2025-04-05 evalengine canonicalisation (sources/2025-04-05-planetscale-faster-interpreters-in-go-catching-up-with-cpp) — this post gives the canonical why evalengine matters (post-route arithmetic on rows emerging from MySQL) four years before the deep-dive on how it's implemented.

The meta-frame closes the post: "This experience is one I've had many times in the past. Someone out there has done a ton of work on something closely related to what we are doing, and all we have to do is adapt the algorithm to our circumstances. For the type of work that we are doing, trying to keep up to date with academia just makes sense." Canonical new patterns/research-to-production-algorithm-adoption pattern: industrial system planners benefit from reading query-optimisation literature; the adaptation cost is small relative to the win.

Key takeaways

  • Every cross-shard aggregate splits into a local part and a global part. Canonical local-global aggregation split: aggregate functions are decomposed so the per-shard contribution is computable locally (pushed to MySQL) and the cross-shard merge is computable at VTGate. COUNT → local COUNT + global SUM; SUM → local SUM + global SUM; MIN/MAX → local MIN/MAX + global MIN/MAX; AVG → local SUM + local COUNT + global SUM/SUM. Taylor: "if the user asked for SELECT count(*) FROM user, Vitess will send down a count(*) to each shard, and then do a SUM on all incoming answers." This is the canonical wiki framing of the primitive, previously referenced implicitly across scatter-gather and cross-shard-query pages without a definitional home.

  • Pushing aggregation under a join is the hard extension. Canonical concepts/push-aggregation-under-join concept: when a join cannot be pushed down (because shard keys don't align), and the query has an outer GROUP BY + aggregate, the naive plan is join first, aggregate last. The Galindo-Legaria & Joshi algorithm licenses a rewrite: pre-aggregate on each side of the join (using a count(*) column to track group cardinality), then project the multiplication at post-join time to restore algebraic equivalence. The cost saving: the join sees fewer rows on the pre-aggregated side, because multiple rows with the same join-key + group-key are collapsed into one row + a cardinality count.

  • The count(*) column is the algebraic trick that makes it work. Without it, "the join would multiply each RHS sum(order_line.amount) by the row count it's joined against" — correct only when the LHS is not pre-aggregated. Once pre-aggregated, the multiplier is explicit as a column, carried through the join, and re-applied at the Project operator. The Project operator uses evalengine to evaluate SUM(order_line.amount) * count(*) at VTGate. Without this explicit column, the pre-aggregation would silently produce wrong answers for SUM on the join's RHS.

  • Nested-loop join is the join shape at VTGate for unalignable shard keys. Canonical concepts/nested-loop-join concept (previously referenced passingly across the wiki; this post canonicalises it). Taylor: "The join is a nested loop join, which means that we'll execute the query on the left-hand side (LHS) of the Join, and using that result we'll issue queries on the right-hand side (RHS), one query per row." Each LHS row produces an RHS query bound on the join column. O(N) RHS queries for N LHS rows — why shrinking the LHS cardinality via pre-aggregation is a large win rather than a small one. If pre-aggregation collapses 1M orders into 1K (office, id) groups, the RHS issues 1K queries instead of 1M.

  • Research-to-production: read the papers. Canonical patterns/research-to-production-algorithm-adoption pattern. Taylor: "I love my job. One of the best feelings is when I find an interesting paper and use it to solve a real problem. It feels like I found a cheat code. Instead of having to do a lot of hard thinking, I can just stand on the shoulders of really big people and take a shortcut." The database-systems community has been publishing query-optimisation work for 40+ years; Vitess's planner has been re-harvesting it. This post is the canonical wiki instance of a Vitess maintainer explicitly crediting an academic paper with the production algorithm (cf. the evalengine post's Brunthaler citation at a different altitude).

  • VTGate is not a dumb proxy. "Vitess is not just a dumb proxy layer though — it can also run some of the operations instead of sending them on." Canonical wiki framing extending the VTGate query planner concept: VTGate owns joins, filters, sorts, group-by, arithmetic (via evalengine), subqueries — every relational operator MySQL can do, VTGate can also do when data can't be colocated. The planner's job is to push as much as possible down to MySQL (which is faster), but VTGate is the fallback executor when pushdown is impossible. This is the architectural inversion from a "connection pooler / query router" framing of VTGate to a "distributed query engine" framing.

  • The aggregation-pushdown-under-join rewrite composes with the VTGate executor architecture. The plan tree (LHS Route → NestedLoopJoin → RHS Route → Project → GroupBy) has each node as a concrete Vitess executor primitive. The LHS group-by is a single-shard operation (or a scatter with per-shard group-by done by MySQL, per the base local-global split). The NestedLoopJoin is a VTGate operator. The Project is a VTGate operator using evalengine. The top GroupBy is a VTGate operator doing the global merge. The algorithm's output is a concrete plan tree built from existing Vitess operators — no new executor infrastructure was needed, just a new planner rewrite rule.

  • Production artifact: vitessio/vitess #9643. Canonical wiki pointer to the concrete Vitess PR that implemented the pushdown. Research-to-production is falsifiable here: the paper is public, the algorithm is public, the implementation is public, and the worked example in the post matches the PR's test cases.

Systems / concepts / patterns surfaced

  • New systems: (none — systems/vitess, systems/mysql, systems/vitess-evalengine, systems/planetscale all canonical already).

  • New concepts (4):

  • concepts/local-global-aggregation-decomposition — the algebraic split of aggregate functions into per-shard-computable (local) and cross-shard-merge (global) parts; the foundational primitive for aggregation pushdown in sharded databases.
  • concepts/push-aggregation-under-join — Galindo-Legaria & Joshi's rewrite: when joins can't be pushed down, pre-aggregate each side of the join with an explicit count(*) column and re-apply the multiplication at post-join projection time.
  • concepts/vtgate-query-planner — the Vitess planner that decides which relational operators go to MySQL (push-down) vs VTGate; canonical concept page extracted from implicit references across the wiki.
  • concepts/nested-loop-join — the iterative-lookup join shape (for each LHS row, issue a query on RHS bound to the join key); the canonical VTGate join shape when the two sides can't be colocated on a single shard.

  • New patterns (3):

  • patterns/local-global-aggregation-split — the production-side pattern of splitting COUNT/SUM/MIN/MAX/AVG into decomposable two-tier execution (shard-local + coordinator-global); generalises beyond Vitess to every distributed-query engine (Spark, Presto, BigQuery, Trino, ClickHouse-distributed).
  • patterns/aggregation-pushdown-under-join — the canonical production instantiation of Galindo-Legaria & Joshi: pre-aggregate each join side + carry count(*) column + Project with multiplication at post-join time. Named as the worked Vitess instance of a general distributed-query-optimisation principle.
  • patterns/research-to-production-algorithm-adoption — the meta-pattern of engineering teams reading academic papers in adjacent areas + adapting algorithms + crediting sources publicly. Taylor's voice canonicalises the argument that this is cost-effective rather than exotic.

  • Extended:

  • systems/vitess — new Seen-in entry framing this as another canonical Vitess-internals disclosure (planner / aggregation-pushdown axis), complementing the evalengine (expression evaluation), VReplication / VDiff / MoveTables (data motion), Consistent Lookup Vindex (cross-shard writes), Throttler (load admission), VStream/VGTID (public CDC), consensus series (leader election), and backup (VTBackup) internals disclosures.
  • systems/vitess-evalengine — new Seen-in entry canonicalising the why evalengine exists (post-route arithmetic at VTGate, including the Project operator's multiplication in aggregation-pushdown-under-join) four years before the 2025-04-05 deep-dive.
  • systems/mysql — extended with the "MySQL is much faster than Vitess at aggregation" framing; canonical wiki datum for why Vitess's design pushes down aggressively.
  • concepts/cross-shard-query — new Seen-in entry canonicalising the aggregation-pushdown-under-join mitigation as one of the ways Vitess compensates for unavoidable cross-shard work.
  • concepts/scatter-gather-query — new Seen-in entry canonicalising that scatter-gather + local-global aggregation is the baseline Vitess already did; the new work is the join-under-aggregation case.

Operational numbers

None disclosed. This post is an algorithm / architecture exposition without production telemetry. No cardinality measurements on the before/after plans, no latency comparisons, no customer workload examples, no regression tests. The PR (vitessio/vitess #9643) contains test cases but no benchmark numbers. The theoretical win is bounded by the pre-aggregation's group-collapsing factor: if (office, id) groups are 1000× smaller than unaggregated rows, the nested-loop join issues 1000× fewer RHS queries. The practical win depends on the workload.

Caveats

  • No production numbers. The post is an algorithm walkthrough; no customer workload demonstrates the win. (Common for Vitess-internals posts at the PlanetScale blog; the PR's tests are the only falsifiable artifact.)
  • Nested-loop join is the only join shape discussed. VTGate can also do sort-merge and hash joins in some cases; the post doesn't address whether the aggregation-pushdown rewrite applies identically or differently for those join shapes. The Galindo-Legaria & Joshi paper treats joins generically; Vitess's instantiation is specifically NL.
  • The count(*) multiplier trick is stated without formal proof of algebraic equivalence. The post walks through the worked example demonstrating correctness on SUM; the paper provides the general proof. The post trusts the paper.
  • AVG is not walked through. Taylor canonicalises SUM + count(*) trick for SUM aggregation. AVG is not explicitly discussed in the post — it's algebraically SUM/COUNT with the same split but carries its own gotchas around weighted averaging across groups; whether Vitess handles AVG via the same rewrite or a separate planner path isn't stated.
  • Joins with more than two tables. The post's worked example is a two-way join. The general algorithm extends to N-way joins but the combinatorics of where to push aggregation grow; Vitess's planner presumably handles this but the post doesn't say.
  • Correlated subqueries. Taylor mentions: "if you had joins or subqueries or anything else other than a simple SELECT ... FROM ... GROUP BY, with a single table, most of the time you were out of luck." The paper's title includes "Subqueries" but this post's worked example is a join. Whether the same machinery handles correlated subqueries isn't explicit.
  • Planner decision to apply the rewrite. The algorithm "sometimes" saves cost. When it doesn't (e.g. when LHS group-collapsing factor is ~1), the pre-aggregation adds overhead without reducing join-RHS query count. The post doesn't say whether Vitess's planner makes this decision statically or adaptively, nor what statistics it uses.
  • Interaction with Reference Tables / Materialize. Vitess 21 shipped reference-table materialization (concepts/reference-table-materialization); when a join's RHS is a reference table (materialized on every shard), the join can be pushed down and this whole rewrite doesn't apply. The post predates reference-table materialization and doesn't discuss the interaction.
  • LHS cardinality reduction depends on query shape. The win is proportional to (original LHS rows) / (group-collapsed LHS rows). For a query grouped by a high-cardinality key (e.g. GROUP BY order.id itself), the reduction is zero and the rewrite is pure overhead. The post's GROUP BY order.office example assumes office has much lower cardinality than order.id.

Source

Last updated · 347 distilled / 1,201 read