PlanetScale — Faster interpreters in Go: Catching up with C++¶
Summary¶
Vicent Martí (PlanetScale) retrospective on rewriting Vitess's SQL
expression evaluation engine — the component inside every
vtgate that evaluates scalar SQL sub-expressions (WHERE,
HAVING, GROUP BY) that cannot be pushed down to MySQL — from an
AST-based interpreter into a bytecode-less virtual machine
implemented in pure Go. The new VM catches up with MySQL's native
C++ expression engine on the same benchmarks (geomean -43.58%
sec/op vs the old AST interpreter's -48.60% relative to MySQL),
allocates zero memory for most expressions, and is easier to
maintain than its predecessor. The article is substantive on three
orthogonal axes: (1) a full taxonomy of interpreter designs
(AST / bytecode-VM / JIT) with explicit trade-offs; (2) a novel
VM design — "compile to a slice of function pointers, not to
bytecode" — that exploits Go's closure semantics as a
poor-man's JIT to capture instruction arguments at compile
time; (3) static type specialization driven by Vitess's
semantic analyzer, which derives operand types from the MySQL
information schema and emits already-specialized instructions
instead of relying on runtime rewriting (quickening). The design
retains the old AST interpreter as a deoptimization fallback
for SQL corner cases where static types are overturned at runtime
(e.g. -BIGINT_MIN promotes to DECIMAL) and as a fuzz-oracle
sibling for continuous differential testing.
Key takeaways¶
-
SQL expression evaluation is a hot path in
vtgate. Vitess's sharding proxy must evaluate scalar SQL expressions locally whenever a sub-expression operates on aggregation output (e.g.HAVING avg_price > 100withAVGaggregated across shards in Go). These expressions are evaluated "once or even more than once for every returned row, so in order to not introduce additional overhead, evaluation needs to be as quick as possible." (Source: sources/2025-04-05-planetscale-faster-interpreters-in-go-catching-up-with-cpp) -
Three-way taxonomy of dynamic-language execution strategies canonicalised. In increasing performance + complexity: AST interpreter (recursively walk the concepts/abstract-syntax-tree), bytecode virtual machine (compile AST to opcodes, simulate a CPU), JIT compiler (compile bytecode to native instructions). Ruby (MRI → YARV), Python, and every modern JavaScript engine transitioned AST → bytecode-VM for the instruction-dispatch win, even though bytecode doesn't help individual ops that are already high-level. "A lot of it boils down to instruction dispatching, which can be made very fast."
-
Quickening (runtime bytecode rewriting) rejected in favour of static type specialization. Brunthaler's Efficient Interpretation using Quickening rewrites generic opcodes (
ADD) into type-specialized variants (ADD_INT64) at runtime once operand types stabilise. Vitess replaces this with a stronger invariant: "the semantic analysis we perform in Vitess is advanced enough that, through careful integration with the upstream MySQL server and its information schema, it can be used to statically type the AST of the SQL expressions we were executing." Types flow from the schema + query planner into every sub-expression at compile time, producing already-specialized bytecode with no runtime type switching. -
Big-switch VM loop rejected — Go compiler makes it worse. The classic C/C++ VM implementation is one function with a giant
switchover opcodes. LuaJIT author Mike Pall's 2011 lua-users post catalogues the problems (register spillage on every branch, hot-vs-cold-branch confusion, compiler struggles with large functions). Go makes them "much worse" — the Go compiler "has historically opted for" fast compile times over optimization, and switch jump-table optimization was implemented surprisingly late + remains fiddly. Branches are often dispatched by binary search rather than a jump table; verifying is only possible by reading generated assembly. -
Tail-call continuation loops work in C/C++ + Python 3.14 but not Go. The modern interpreter state-of-the-art outside Go is "continuation-style evaluation loops" — each opcode is a free-standing function whose return value is a callback to the next step; LLVM's
musttailattribute forces tail-call conversion so the runtime effectively jumps between opcode functions (Python 3.14's interpreter ships this design for up to 30% speedup). This design does not work in Go — the Go compiler can sometimes emit tail calls but "needs to be tickled in just the right way, and this implementation simply does not work in practice unless the tail-calls are guaranteed at compilation time." -
"Compile to a slice of function pointers" — the poor-man's JIT that does work in Go. Martí's key design move: don't emit bytecode at all. Instead, emit each instruction as a closure pushed onto a
[]func(*VirtualMachine) intslice. The VM loop becomes trivial:
func (vm *VirtualMachine) execute(p *Program) (eval, error) {
code := p.code
ip := 0
for ip < len(code) {
ip += code[ip](vm)
if vm.err != nil {
return nil, vm.err
}
}
// ...
}
Each instruction returns an offset: 1 for sequential,
positive/negative for control flow. Two independent wins:
the VM is a few lines of code with no switch; the
compiler is trivial because there is no bytecode encoding
to keep in sync with the VM — no opcode table, no
instruction format, no encode/decode layer. "Developing the
compiler means developing the VM simultaneously."
- Go closures capture instruction arguments for free. The
traditional C objection — "without bytecode you can't have
instructions with arguments" — evaporates in Go because
closures close over their lexical scope. The compiler
emits a closure containing a copy of arguments; no encoding
layer needed. Example pushes a
TEXTSQL value from columnoffsetwith collationcol, both statically baked into the generated instruction:
func (c *compiler) emitPushColumn_text(offset int, col collations.TypedCollation) {
c.emit(func(vm *VirtualMachine) int {
vm.stack[vm.sp] = newEvalText(vm.row[offset].Raw(), col)
vm.sp++
return 1
})
}
-
Deoptimization-to-AST handles dynamic-type corner cases. A small set of SQL operations can promote types on values (not on declared types) — canonical example: negating
BIGINT_MIN=-9223372036854775808yieldsDECIMAL, notBIGINT, because|MIN_INT64|doesn't fit in 64 bits. Rather than re-introduce runtime type switches, the VM bails out of the specialized instruction (vm.err = errDeoptimize) and falls back to the old AST interpreter, which has always type-switched at runtime. This is "very similar to what JIT compilers do when they detect that the runtime type of a value no longer matches the generated code they've emitted; they fall back from the native code to the virtual machine." Trade-off: the AST interpreter "can never be removed from Vitess." -
AST interpreter retention pays for itself on two more axes. Beyond deoptimization: (a) single-shot evaluations (e.g. constant folding in the planner) skip the compile-VM cost and run on the AST directly; (b) the AST becomes a fuzz-oracle sibling of the VM — fuzzing both against each other "has resulted in an invaluable tool for detecting bugs and corner cases." Vitess's test suite + fuzzer routinely find bugs in MySQL's own C++ evaluation engine that they upstream (collation bug PR 602,
insertSQL function PR 517, substring-search PR 515). -
Benchmark result: Go VM catches up with C++. On five queries ranging from complex arithmetic to integer comparison, geometric mean improvement: old AST → ast (statically typed AST) = unstated but intermediate → vm = −48.60% vs old AST, vs MySQL C++ = −43.58% vs old AST. The VM is faster than MySQL on four of five benchmarks (
complex_arith,comparison_u64,comparison_dec,comparison_f) and essentially tied oncomplex_arith(50.77n vs 49.40n). The VM also allocates zero memory on four of five benchmarks (100.00%B/op + allocs/op drops) thanks to fully specialized instructions — a nice side effect of static typing eliminating runtime boxing. -
JIT explicitly rejected as next step. Martí's framing: JIT wins when instruction dispatch dominates because each opcode is low-level (an
ADDbecomes a nativeADD). For SQL expressions, most opcodes remain high-level ("match this JSON object with a path", "add two fixed-width decimals together") — instruction dispatch is <20% of runtime (benchmarked), "not the number you're targeting before you start messing around with raw assembly for a JIT." First canonical wiki statement of the dispatch-overhead-share threshold as the decision rule for when JIT is justified.
Architecture at a glance¶
SQL query text
│
▼
Parser → AST
│
▼
Semantic analyzer + information schema
(static types propagated through AST)
│
┌───────┴───────┐
▼ ▼
AST interpreter VM compiler
(one-shot / fallback) │
▼
[]func(*VM) int ← slice of closures
│
▼
VirtualMachine.execute
(trivial for-loop)
│
┌─────────────────┴─────────────────┐
▼ ▼
Normal path Deoptimization path
(specialized op) (fall back to AST)
Operational numbers¶
- Dispatch-overhead share: < 20% of VM runtime (measured).
- Benchmark geomean vs old AST: VM −48.60%, MySQL C++ −43.58%.
- Memory allocations, VM vs statically-typed AST (per op):
complex_arith— 96 B → 0 B, 9 allocs → 0comparison_i64— 16 B → 0 B, 1 alloc → 0comparison_u64— 16 B → 0 B, 1 alloc → 0comparison_dec— 64 B → 40 B (−37.50%), 3 allocs → 2 (−33.33%)comparison_f— 16 B → 0 B, 2 allocs → 0- VM speedup over original Vitess AST (
oldcolumn): complex_arith— 162.75n → 50.77n (−68.81%)comparison_i64— 30.30n → 16.95n (−44.08%)comparison_u64— 30.57n → 17.49n (−42.78%)comparison_dec— 70.75n → 52.58n (−25.68%)comparison_f— 53.05n → 25.65n (−51.64%)- Upstream bug fixes found by Vitess fuzzer in MySQL: 3
referenced (collation PR 602,
INSERTfunction PR 517, substring PR 515).
Caveats¶
- Post is a retrospective, not a production metrics
disclosure. No production QPS, no per-tenant evaluation
rate, no p99 impact at the
vtgatelevel, no disclosure of query-mix composition by evaluation hotness. Benchmarks are micro-benchmarks withsec/opgeometric mean. - MySQL baseline measurement is hand-instrumented. "These
are not the total response times for a query, but the result
of manual instrumentation in the
mysqldserver to ensure a fair comparison." Comparability is author-attested; no methodology attached. - Deoptimization cost not quantified. How often deoptimization fires, latency penalty when it does, AST-interpreter warm-up — all undisclosed.
- Fuzz-oracle sibling is a maintenance cost. The old AST interpreter "can never be removed" — two code paths must stay semantically consistent forever. Worth it for Martí given the fuzz-correctness payoff but an explicit trade-off.
- SQL expression sub-language is Turing-incomplete. No loops, minimal control flow. The VM's simplification — return an offset per opcode — works because flow is lineal; a general- purpose language would require a more elaborate VM.
- Closure allocation not measured. Each
emit*closure captures arguments; the zero-alloc claim is about evaluation, not compilation. Compile-time closure allocations amortise across all invocations of a query plan. - Technique novelty is limited to Go. Martí notes the slice-of-closures approach appears "in the wild" in a rules-based authorization engine; what's new is applying it in Go, where (a) the compiler supports closures efficiently and (b) tail-call-based alternatives don't work.
- Go compiler gap is the load-bearing assumption. Every design choice reacts to Go's specific optimization limits (no reliable tail calls, no reliable switch jump tables, large-function spill problem). The post is not a language- agnostic treatise — it's a "use Go's strengths, not fight them" retrospective. Conclusions may not transfer to a language with a better optimizer.
- Expected future trajectory (JIT) rejected but not permanently closed. Martí frames JIT as "needlessly complex dead optimization" for SQL specifically; the rejection is informed by the measured <20% dispatch share, not a general disavowal.
Source¶
- Original: https://planetscale.com/blog/faster-interpreters-in-go-catching-up-with-cpp
- Raw markdown:
raw/planetscale/2025-04-05-faster-interpreters-in-go-catching-up-with-c-0213a667.md
Related¶
- systems/vitess — containing system.
- systems/vitess-evalengine — the subject of the post.
- systems/mysql — reference implementation for correctness; the VM's fuzz oracle finds upstream MySQL bugs.
- systems/planetscale — PlanetScale MySQL is built on Vitess.
- concepts/abstract-syntax-tree — starting IR.
- concepts/bytecode-virtual-machine — the family the new design lives in.
- concepts/ast-interpreter — predecessor + deoptimization target.
- concepts/jit-compilation — rejected next step.
- concepts/static-type-specialization — load-bearing mechanism.
- concepts/quickening-runtime-bytecode-rewrite — rejected alternative.
- concepts/instruction-dispatch-cost — performance dimension.
- concepts/tail-call-continuation-interpreter — C/C++/Python alternative rejected for Go.
- concepts/callback-slice-interpreter — the design this post canonicalises.
- concepts/vm-deoptimization — fallback mechanism.
- concepts/go-compiler-optimization-gap — load-bearing constraint.
- concepts/jump-table-vs-binary-search-dispatch — Go switch codegen issue.
- patterns/callback-slice-vm-go — the new pattern.
- patterns/static-type-specialized-bytecode — planner-derived specialization.
- patterns/vm-ast-dual-interpreter-fallback — the deopt + oracle architecture.
- patterns/fuzz-ast-vs-vm-oracle — differential testing across interpreter generations.
- companies/planetscale