Skip to content

CONCEPT Cited by 4 sources

Abstract Syntax Tree

An Abstract Syntax Tree (AST) is an intermediate representation (IR) of a program, query, or expression, structured as a tree whose interior nodes are operators / productions and whose leaves are atomic tokens (literals, identifiers, filter terms). The tree abstracts the source — whitespace, punctuation, parse-stage-only scaffolding are dropped; only the semantic structure remains.

ASTs are the standard IR for any tool that must recurse over the input to emit something else: compilers, linters, refactoring tools, transpilers, and — the focus of this concept page — query languages that compile to a backend query document.

Why the AST is load-bearing for query-language rewrites

For a flat query language — e.g. assignee:@me label:support new-project where every term is implicitly AND-joined — a list is a sufficient IR. Each term maps independently to a backend clause; the backend combines them linearly.

Once the language grows nesting (parentheses, arbitrary operators across fields), the list breaks. You need a structure that can represent a tree of operators with subtrees as operands. The AST is that structure.

The sequence GitHub Issues went through is the standard story (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean):

  1. Original: flat parser emits a list → linear map to Elasticsearch snippets → concatenate.
  2. First-gen upgrade (2021): comma-OR on label: only, shoehorned onto the list representation — works for one field, can't generalize.
  3. 2025 rewrite: parse to AST → recursive traversal → emit nested Elasticsearch bool query. Both legacy flat queries and new nested queries become valid trees (legacy queries become left-leaning AND-spine trees) so the backward-compat story is "same IR, different depths" rather than "two code paths".

Worked example

Input:

is:issue AND (author:deborah-digges OR author:monalisa)

AST (simplified shape — captures and strips parens):

root
 └── and
      ├── filter_term(is, issue)          # left
      └── or                                # right
           ├── filter_term(author, deborah-digges)
           └── filter_term(author, monalisa)

A recursive walker visiting this AST emits one bool-query clause per node: - and{bool: {must: [<left>, <right>]}} - or{bool: {should: [<left>, <right>]}} - leaf filter_term(K, V) → backend-specific clause

See patterns/ast-based-query-generation for the pattern.

AST vs parse tree

  • Parse tree (concrete syntax tree, CST) contains every rule match the grammar makes, including punctuation nodes, whitespace nodes, list-of-one-element nodes, etc.
  • AST discards parse-only scaffolding and keeps only the semantically meaningful nodes.

In parslet, the raw grammar emits a parse tree (nested Hash of Slices); a Transform rewrites it to the AST. Some parsing libraries collapse both steps into one.

When an AST is not warranted

  • Truly flat DSLs (e.g., tag filters with no nesting, single-operator filters): a list/struct is enough.
  • Single-use query builders where the query text maps 1:1 to backend calls by string substitution: skipping the AST saves a layer of indirection.
  • Small vocabularies where the language can never grow — e.g. a URL-parameter-driven search where every recognised parameter is a top-level field. The moment a ( shows up in a feature request, this judgement has to be revisited.

The GitHub Issues 2021 comma-OR delivery is a case study in the second answer being correct at the time and incorrect four years later: the rewrite became inevitable when the syntax had to nest, but the interim delivery was not wasted.

Seen in

First wiki instance of the same AST consumed by two coexisting interpreters — Vitess retains the AST interpreter permanently as deoptimization fallback + fuzz-oracle sibling. Extends the AST-as-IR story beyond query-language rewrites and C++-codegen to expression-evaluation runtimes as a third distinct consumption pattern.

  • AST as substrate for both query generation and query simplification. Arvind Murty's 2023 Vitess- planner-fuzzer retrospective canonicalises two AST consumption patterns at the same site: (1) random AST generation — the fuzzer emits random SQL by walking a grammar-backed template, generating random tables, columns, predicates, and expressions with semantic-role constraints (aggregations only in the legal positions: SELECT, GROUP BY, ORDER BY, HAVING); and (2) AST-based [[concepts/query- simplifier|query simplification]] — Andrés Taylor's simplifier walks the AST of a failing query and brute-force removes or modifies nodes, re-running the query to check if the error persists, to produce the minimal reproducer. Distinct from the query-language-DSL pattern (user-input → AST → backend-query) and the expression-runtime pattern (AST → interpreter): here the AST is the substrate for automated test-case generation and minimisation — the delta-debugging direction. Shipped as vitessio/vitess #13260 (fuzzer) + vitessio/vitess #13636 (simplifier).

Seen in — LLM-pipeline guidance role (new 2024-06)

  • sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library — Slack's Enzyme-to-RTL codemod canonicalises a new role for the AST: as both a conversion primitive and a hallucination- control primitive for LLM-based code-transformation pipelines. The AST pass resolves the deterministic Enzyme→RTL rewrites it has rules for (top-10 methods in Slack's codebase: find 13,244 calls, prop 3,050, simulate 2,755, text 2,181, update 2,147, instance 1,549, props 1,522, hostNodes 1,477, exists 1,174, first 684 — plus 55 more), and for the cases it cannot handle leaves in-code annotation comments that shape the LLM's attention at the call site. Slack frames this explicitly: "By automating the conversion of simpler cases and providing annotations for all other instances through comments in the converted file, we successfully minimized hallucinations and nonsensical conversions from the LLM." Distinct from the fuzzing, interpreter, codegen, and query-rewrite AST roles already canonicalised: here the AST-produced artifact is input to an LLM, not input to a downstream code-generator or runtime. AST-only hit a ~45% conversion ceiling at Slack (context like the rendered DOM is invisible to the AST); composed with the LLM the pipeline reached ~80% on evaluation files. Ships as the open-source tool @slack/enzyme-to-rtl-codemod.

Seen in — LLM-autofixer role (new 2026-01)

  • sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent — Vercel v0's post-stream autofixer role for the AST: parse the model's generated code to detect invariant violations (e.g. "is useQuery from @tanstack/react-query wrapped in a QueryClientProvider ancestor?", "does every import in the generated code have a matching entry in package.json?") and feed the AST context to a small fine-tuned model that decides where to emit the fix. Deterministic fixes run in under 250 ms without a model call (e.g. package.json completion); judgment fixes compose the AST invariant-check with a placement model (patterns/deterministic-plus-model-autofixer). This is the cross-file / whole-artifact altitude of AST consumption — distinct from streaming-rewrite scope (per-token, local) and from the Slack Enzyme-to-RTL codemod which parses the source AST at conversion time. Here the AST parses the LLM's output after it has finished streaming, detects invariant violations objectively, and invokes the LLM only where judgment is required — a sixth AST-consumption role alongside query-language IR, codegen, expression interpretation, fuzzing, and LLM-attention-shaping.
Last updated · 542 distilled / 1,571 read