CONCEPT Cited by 1 source
PEG grammar (Parsing Expression Grammar)¶
A Parsing Expression Grammar (PEG) is a formal grammar class
(Ford 2004) defined by ordered choice (A | B means try A
first, fall back to B only if A fails), greedy-and-committed
parsing (no backtracking across the choice operator once one
alternative succeeds inside a sequence), and no separate lexer
(the grammar operates directly on characters). Because alternation
is ordered and parsing is deterministic, PEGs avoid the ambiguity
class that makes CFG-based parsers hard to reason about — there
is never "more than one valid parse" for a given input, and a
top-down recursive-descent implementation matches the grammar
directly.
PEGs are a natural fit for query languages and config DSLs — small grammars where the author wants exact control over precedence, grouping, and dispatch between legacy and new syntax.
Key properties¶
- Ordered choice (
/or|depending on notation):expr = A | BtriesAfirst, thenB. Unlike a CFG's unordered alternation, this gives the grammar author explicit control. - Unbounded lookahead via
&/!predicates (syntactic predicate operators):expr = A !Bmeans match A only if not followed by B, without consuming B. - No lexer phase: the grammar parses characters directly, sidestepping the lexer/parser impedance mismatch (handy for DSLs with fluid token boundaries).
- Deterministic + unambiguous: any grammar that type-checks produces exactly one parse per input (or fails).
- Memoization (packrat parsing) yields linear time at the cost of memory. Many implementations (including parslet) do not memoize by default — they rely on grammars that don't re-enter the same rule at the same position.
- Left-recursion is natively rejected. Traditional PEG implementations require rewriting left-recursive rules into right-recursive or iterative form. Some modern PEG tools (e.g., Medeiros' left-recursion extension) handle it; parslet does not.
Operator-precedence layering idiom¶
The canonical PEG idiom for encoding binary operator precedence is layered right-recursion: one rule per precedence level, each rule matching at least one higher-precedence expression, optionally followed by its operator and a recursive tail at its own level.
or_operation = and_operation (OR or_operation)?
and_operation = primary (AND and_operation)?
primary = "(" or_operation ")" | atom
Because or_operation is the root and recurses through
and_operation before consuming an OR, AND binds tighter than
OR. Parenthesised expressions re-enter at the lowest precedence
(the root), so grouping cleanly overrides precedence.
This is the exact skeleton the systems/parslet library ships
as its boolean_algebra.rb example, which GitHub's Issues-search
rewrite post quotes as the grammar core.
(Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)
Why PEG over CFG for production query languages¶
- Ambiguity doesn't silently creep in. LALR / earley parsers can produce parser tables that accept ambiguous grammars; PEG forbids it by construction.
- Grammar reads as code, not as a notation the team has to re-learn during on-call.
- Easy to add new syntax without breaking old syntax. The
GitHub Issues rewrite had to accept both the legacy flat query
form (
assignee:@me label:support new-project) and new nested form (author:A AND (label:bug OR label:epic)). PEG's ordered alternation lets the grammar have one entry rule that tries the nested form first and falls back to the flat form, with tests enforcing no legacy query has a nested-form parse that disagrees. - Error messages are grammar-literal. Failures point at exactly which rule failed where, which is useful for showing syntax errors in a search box.
Trade-offs¶
- Tooling is less mature than CFG ecosystems (yacc/bison derivatives, ANTLR). For very large grammars (full programming languages), tooling gaps show up.
- Left-recursion rewrite tax. Right-recursive alternatives (as above) are mechanical but can obscure intent vs a natural left-recursive spec.
- Performance without packrat can degenerate to exponential on pathological inputs. In practice, most query-DSL grammars are linear-time on human-written inputs without memoization, but adversarial input requires care.
Seen in¶
- sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean — canonical wiki instance. GitHub's Issues-search 2025 rewrite uses a PEG grammar (via systems/parslet) to parse user search strings into an AST. Grammar accepts both legacy flat queries and new nested boolean queries as distinct entry rules — backward compat as a grammar- design property.
Related¶
- concepts/abstract-syntax-tree — the IR PEG parsers typically emit.
- systems/parslet — Ruby PEG parser combinator library.
- patterns/ast-based-query-generation — the end-to-end shape PEG-based query rewrites fit into.