PLANETSCALE 2022-12-15

PlanetScale — What is a query planner?¶

Summary¶

Andres Taylor (PlanetScale / Vitess core, 2022-12-15) publishes a pedagogy-altitude explainer framing query planners as compilers: both take a higher-level textual input (SQL / source code) and produce a lower-level executable artefact (query plan / machine code) via the same four phases — lex+parse, semantic analysis, optimisation, code generation. The post's load-bearing original contribution is canonicalising join-order selection as a path-finding problem — visiting all tables in a join graph via the cheapest possible edge sequence, with the Vitess-specific cost metric being number of network calls to MySQL shards. Names two canonical Vitess optimisations: predicate rewriting to unlock indexes (converting WHERE (id=5 AND name='Toto') OR (id=5 AND name='Mumin') to WHERE id=5 AND (name='Toto' OR name='Mumin') — factoring the indexed column out of the OR) and local-global aggregation (push as much aggregation into MySQL as possible, then "aggregate the aggregates" at VTGate). Introduces the TPC-H #8 join-graph diagram as the canonical visual for the path-finding framing.

(Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)

Key takeaways¶

Query planner = compiler with a different output format. Both take a higher-level textual input and produce a lower-level executable artefact; both structure the work as four sequential phases (lex+parse → semantic analysis → optimisation → code generation); both exploit an intermediate representation (IR) that diverges from the input language to make optimisations easier. The analogy is not a metaphor — it is a shared architectural template across two traditionally separate fields. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Lex + parse is uncontroversially shared. Both tools tokenise the input, check it against a grammar, and emit an abstract syntax tree (AST). "There is no interesting difference between a compiler and a planner here." The post's worked example is SELECT name, avg(salary) FROM employees JOIN salary_info ON id = empid → AST with a SELECT root, JOIN child, and column-reference leaves. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Semantic analysis binds symbols to schema. Where compilers bind identifiers to classes / methods / variables, planners bind column references to tables, infer types for expressions, and validate that column names resolve. "A query planner does almost exactly the same thing here. Instead of searching for classes and methods, it would bind to tables and columns, but the idea is the same." After semantic analysis the tree is enriched — every column node knows its source table, every expression knows its type. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Optimisation is where the planner earns its keep. Uses algorithms + heuristics to pick indexes, join algorithms, join order, and decomposition strategies. Worked example: Vitess rewriting WHERE (id=5 AND name='Toto') OR (id=5 AND name='Mumin') — which cannot use an id index because of the OR at the top — into the equivalent WHERE id=5 AND (name='Toto' OR name='Mumin'). The rewritten form has id=5 at the top, so the planner can use the id index to locate the row, then apply the name filter. Both predicates evaluate to the same row set by boolean algebra, but only the rewritten form is sargable. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Join order is a path-finding problem. Canonical framing: "Say we want to join three tables: A with B, and B with C. We could start by joining A with B, and take the output of that and join it with C. Or we can start from the other side — join B with C and then join that result with A. The intermediate state needed is where the big difference comes in." The planner's search space is the set of all join trees; the cost function scores each tree by the cumulative intermediate state size along its edges. TPC-H query #8 is the canonical visual (8-table join graph; the planner must visit every table via the cheapest edge sequence). (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Vitess's cost metric is network calls to MySQL. "In Vitess, our query plans are partly executed on the SQL proxy layer, called VTGate, and partly on the individual shards. Probably the most important optimization we perform is to push down as much work as possible to MySQL. If we can perform a join or a filter in MySQL, that is always going to be faster than fetching all the individual rows and performing the same operation on the VTGate side. So, during query planning, we are searching for the query plan that has the least number of network calls." This is the shard-specific inversion of single-node cost models — single-node planners minimise CPU + I/O; cross-shard planners minimise network hops because each hop costs milliseconds versus microseconds for local operations. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Aggregation strategy: aggregate the aggregates. "When planning aggregations, our strategy is to do as much aggregation as possible in MySQL, and then aggregate the aggregates. The planner rewrites the aggregation that the user asked for into smaller aggregations and sends those to MySQL. The results of these queries are then used as inputs and summarized into the final aggregation result." This is the pedagogy-altitude framing of local-global aggregation — the canonical distributed-query primitive Vitess has shipped since its early days. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)
Code generation = the executable plan tree. "The query planner generates a plan that specifies the exact steps that the database engine should take to execute the query. This plan may include operations such as index scans, join algorithms, and sorting algorithms, as well as other details such as the order in which the operations should be performed." The output is not machine code but a tree of executor operators; the difference from a compiler is that this tree is interpreted, not assembled. (Source: sources/2026-04-21-planetscale-what-is-a-query-planner.)

Systems extracted¶

Vitess — the MySQL-sharding substrate whose planner Taylor is writing about.
VTGate — the proxy tier that hosts the Vitess query planner.
MySQL — the push-down target; the planner's core optimisation is "delegate to MySQL."
PlanetScale — the vendor context.

Concepts extracted¶

concepts/query-planner — new canonical concept: the component that takes a SQL string and emits an execution plan; shared four-phase architecture with compilers.
concepts/compiler-planner-analogy — new canonical concept: the structural isomorphism between compiler phases (lex → parse → sem-analysis → optimisation → codegen) and query-planner phases (lex → parse → sem-analysis → optimisation → codegen).
concepts/join-order-as-pathfinding — new canonical concept: join-order selection is a graph-search problem where nodes = tables, edges = join predicates, cost = cumulative intermediate state size (or in Vitess's case, cumulative network hops).
concepts/abstract-syntax-tree — extended with Vitess-specific framing (post's worked SELECT name, avg(salary) FROM employees JOIN... AST).
concepts/vtgate-query-planner — extended (pedagogy-altitude framing from the same author who wrote the optimisation-deep-dives and phase-ordering incident retrospective).
concepts/push-aggregation-under-join — extended (pedagogy-altitude citation).
concepts/local-global-aggregation-decomposition — extended (canonical "aggregate the aggregates" framing).
concepts/planner-phase-ordering — extended (pedagogy-altitude Seen-in).

Patterns extracted¶

patterns/aggregation-pushdown-under-join — extended (pedagogy-altitude citation).
patterns/local-global-aggregation-split — extended (canonical "aggregate the aggregates" framing).

Operational numbers / context¶

TPC-H query #8 (8-table join) as canonical visual for join-order search space.
Single query with many tables + many indexes: "the available options can quickly run into the thousands and even millions of alternatives."
Network calls (cross-shard) are the dominant cost metric for VTGate; local-node operations are effectively free by comparison.
WHERE (id=5 AND name='Toto') OR (id=5 AND name='Mumin') → rewrite to WHERE id=5 AND (name='Toto' OR name='Mumin') — post's worked predicate-rewrite example.

Caveats¶

Pedagogy / framing essay — no production numbers, no incident retrospective, no code.
Four-phase framing is a teaching simplification; real planners (Cascades, Volcano, Vitess's Gen4) blend phases via fixed-point rewriters and memoisation; see concepts/planner-phase-ordering for the more-complex reality.
Join-order "path-finding" framing is descriptive; the post does not name the underlying algorithm (dynamic programming à la System-R, or heuristic search, or cost-based rule engines like Cascades).
"Millions of alternatives" claim is folklore — the exact count depends on table count × index count × join-predicate count and is planner-specific.
The OR-factoring rewrite example is canonical SQL-rewriting lore but the post does not disclose whether Vitess actually performs this rewrite or whether it is illustrative.
Post predates the 2024-07-22 phase-ordering incident retrospective from the same author by ~19 months; the aggregation pushdown narrative here is the clean pedagogy story before the production phase-ordering bug surfaced.

Source¶

Original: https://planetscale.com/blog/what-is-a-query-planner
Raw markdown: raw/planetscale/2026-04-21-what-is-a-query-planner-79777b96.md