CONCEPT Cited by 4 sources
Abstract Syntax Tree¶
An Abstract Syntax Tree (AST) is an intermediate representation (IR) of a program, query, or expression, structured as a tree whose interior nodes are operators / productions and whose leaves are atomic tokens (literals, identifiers, filter terms). The tree abstracts the source — whitespace, punctuation, parse-stage-only scaffolding are dropped; only the semantic structure remains.
ASTs are the standard IR for any tool that must recurse over the input to emit something else: compilers, linters, refactoring tools, transpilers, and — the focus of this concept page — query languages that compile to a backend query document.
Why the AST is load-bearing for query-language rewrites¶
For a flat query language — e.g. assignee:@me label:support
new-project where every term is implicitly AND-joined — a list
is a sufficient IR. Each term maps independently to a backend
clause; the backend combines them linearly.
Once the language grows nesting (parentheses, arbitrary operators across fields), the list breaks. You need a structure that can represent a tree of operators with subtrees as operands. The AST is that structure.
The sequence GitHub Issues went through is the standard story (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean):
- Original: flat parser emits a list → linear map to Elasticsearch snippets → concatenate.
- First-gen upgrade (2021): comma-OR on
label:only, shoehorned onto the list representation — works for one field, can't generalize. - 2025 rewrite: parse to AST → recursive traversal → emit
nested Elasticsearch
boolquery. Both legacy flat queries and new nested queries become valid trees (legacy queries become left-leaning AND-spine trees) so the backward-compat story is "same IR, different depths" rather than "two code paths".
Worked example¶
Input:
AST (simplified shape — captures and strips parens):
root
└── and
├── filter_term(is, issue) # left
└── or # right
├── filter_term(author, deborah-digges)
└── filter_term(author, monalisa)
A recursive walker visiting this AST emits one bool-query clause
per node:
- and → {bool: {must: [<left>, <right>]}}
- or → {bool: {should: [<left>, <right>]}}
- leaf filter_term(K, V) → backend-specific clause
See patterns/ast-based-query-generation for the pattern.
AST vs parse tree¶
- Parse tree (concrete syntax tree, CST) contains every rule match the grammar makes, including punctuation nodes, whitespace nodes, list-of-one-element nodes, etc.
- AST discards parse-only scaffolding and keeps only the semantically meaningful nodes.
In parslet, the raw grammar emits a parse tree (nested Hash of
Slices); a Transform rewrites it to the AST. Some parsing
libraries collapse both steps into one.
When an AST is not warranted¶
- Truly flat DSLs (e.g., tag filters with no nesting, single-operator filters): a list/struct is enough.
- Single-use query builders where the query text maps 1:1 to backend calls by string substitution: skipping the AST saves a layer of indirection.
- Small vocabularies where the language can never grow — e.g.
a URL-parameter-driven search where every recognised parameter
is a top-level field. The moment a
(shows up in a feature request, this judgement has to be revisited.
The GitHub Issues 2021 comma-OR delivery is a case study in the second answer being correct at the time and incorrect four years later: the rewrite became inevitable when the syntax had to nest, but the interim delivery was not wasted.
Seen in¶
-
sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean — canonical query-language instance. GitHub's Issues-search rewrite replaces a flat-list IR with an AST emitted by a PEG grammar via systems/parslet, then recursively compiles the AST to a nested Elasticsearch
boolquery. -
sources/2026-04-09-meta-escaping-the-fork-webrtc-modernization — C++-codegen variant. Meta AST-parses libwebrtc headers (class declarations, struct layouts, enum values, method signatures) and uses the AST as input to a shim-boilerplate generator that emits adapter + converter + unit-test scaffolding for the WebRTC shim. Velocity jumped from 1 shim/day → 3–4 shims/day after codegen adoption. Distinct from the query-language case: here the AST is of library headers, consumed by a tool emitting more C++ code, not a query against a backend.
-
— SQL-expression-execution variant. The AST is the starting IR in Vitess's SQL expression evaluation engine; the post covers two interpretations of the same AST:
-
AST interpreter (original Vitess evalengine) — recursively walks the AST evaluating each node. Canonicalises the concepts/ast-interpreter concept as the simple starting point for dynamic-expression execution.
- Bytecode-less VM (Vitess 2025 rewrite) — compiles
the AST down to a
[]func(*VirtualMachine) intslice of Go closures (patterns/callback-slice-vm-go) driven by static type specialization derived from MySQL's information schema.
First wiki instance of the same AST consumed by two coexisting interpreters — Vitess retains the AST interpreter permanently as deoptimization fallback + fuzz-oracle sibling. Extends the AST-as-IR story beyond query-language rewrites and C++-codegen to expression-evaluation runtimes as a third distinct consumption pattern.
- — AST as substrate for both query generation and
query simplification. Arvind Murty's 2023 Vitess-
planner-fuzzer retrospective canonicalises two AST
consumption patterns at the same site: (1) random AST
generation — the fuzzer emits random SQL by walking a
grammar-backed template, generating random tables,
columns, predicates, and expressions with
semantic-role constraints (aggregations only in the
legal positions:
SELECT,GROUP BY,ORDER BY,HAVING); and (2) AST-based [[concepts/query- simplifier|query simplification]] — Andrés Taylor's simplifier walks the AST of a failing query and brute-force removes or modifies nodes, re-running the query to check if the error persists, to produce the minimal reproducer. Distinct from the query-language-DSL pattern (user-input → AST → backend-query) and the expression-runtime pattern (AST → interpreter): here the AST is the substrate for automated test-case generation and minimisation — the delta-debugging direction. Shipped as vitessio/vitess #13260 (fuzzer) + vitessio/vitess #13636 (simplifier).
Related¶
- concepts/peg-grammar — grammar class that emits ASTs well.
- concepts/ast-interpreter — simplest AST consumer; executes directly.
- concepts/bytecode-virtual-machine — AST compiled to bytecode for faster execution.
- concepts/callback-slice-interpreter — AST compiled to Go closures (Vitess).
- concepts/static-type-specialization — typed AST enables specialized opcodes.
- concepts/query-simplifier — AST-node-removal delta-debugger for SQL queries.
- concepts/sqlancer-logic-bug — DBMS logic-bug class that AST-based fuzzers target.
- patterns/ast-based-query-generation — the pattern that uses this IR in production query-language rewrites.
- patterns/ast-codegen-for-boilerplate-shim — C++ codegen consumer.
- patterns/callback-slice-vm-go — Vitess's Go interpreter consumer.
- patterns/static-type-specialized-bytecode — static typing on the AST.
- patterns/mysql-compatible-differential-fuzzing — AST-generated queries run against target + reference DBMS.
- systems/parslet — Ruby PEG parser that emits into transforms-to-AST.
- systems/github-issues — canonical user of this shape.
- systems/vitess-evalengine — canonical SQL-evaluation consumer.
- systems/sqlancer — DBMS logic-bug fuzzer that generates and mutates SQL ASTs.
Seen in — LLM-pipeline guidance role (new 2024-06)¶
- sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library
— Slack's Enzyme-to-RTL codemod canonicalises a new role
for the AST: as both a conversion primitive and a
hallucination-
control primitive for LLM-based code-transformation
pipelines. The AST pass resolves the deterministic Enzyme→RTL
rewrites it has rules for (top-10 methods in Slack's codebase:
find13,244 calls,prop3,050,simulate2,755,text2,181,update2,147,instance1,549,props1,522,hostNodes1,477,exists1,174,first684 — plus 55 more), and for the cases it cannot handle leaves in-code annotation comments that shape the LLM's attention at the call site. Slack frames this explicitly: "By automating the conversion of simpler cases and providing annotations for all other instances through comments in the converted file, we successfully minimized hallucinations and nonsensical conversions from the LLM." Distinct from the fuzzing, interpreter, codegen, and query-rewrite AST roles already canonicalised: here the AST-produced artifact is input to an LLM, not input to a downstream code-generator or runtime. AST-only hit a ~45% conversion ceiling at Slack (context like the rendered DOM is invisible to the AST); composed with the LLM the pipeline reached ~80% on evaluation files. Ships as the open-source tool@slack/enzyme-to-rtl-codemod.
Seen in — LLM-autofixer role (new 2026-01)¶
- sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent
— Vercel v0's post-stream autofixer role for the AST:
parse the model's generated code to detect invariant
violations (e.g. "is
useQueryfrom@tanstack/react-querywrapped in aQueryClientProviderancestor?", "does every import in the generated code have a matching entry inpackage.json?") and feed the AST context to a small fine-tuned model that decides where to emit the fix. Deterministic fixes run in under 250 ms without a model call (e.g.package.jsoncompletion); judgment fixes compose the AST invariant-check with a placement model (patterns/deterministic-plus-model-autofixer). This is the cross-file / whole-artifact altitude of AST consumption — distinct from streaming-rewrite scope (per-token, local) and from the Slack Enzyme-to-RTL codemod which parses the source AST at conversion time. Here the AST parses the LLM's output after it has finished streaming, detects invariant violations objectively, and invokes the LLM only where judgment is required — a sixth AST-consumption role alongside query-language IR, codegen, expression interpretation, fuzzing, and LLM-attention-shaping.