Skip to content

CONCEPT Cited by 1 source

PEG grammar (Parsing Expression Grammar)

A Parsing Expression Grammar (PEG) is a formal grammar class (Ford 2004) defined by ordered choice (A | B means try A first, fall back to B only if A fails), greedy-and-committed parsing (no backtracking across the choice operator once one alternative succeeds inside a sequence), and no separate lexer (the grammar operates directly on characters). Because alternation is ordered and parsing is deterministic, PEGs avoid the ambiguity class that makes CFG-based parsers hard to reason about — there is never "more than one valid parse" for a given input, and a top-down recursive-descent implementation matches the grammar directly.

PEGs are a natural fit for query languages and config DSLs — small grammars where the author wants exact control over precedence, grouping, and dispatch between legacy and new syntax.

Key properties

  • Ordered choice (/ or | depending on notation): expr = A | B tries A first, then B. Unlike a CFG's unordered alternation, this gives the grammar author explicit control.
  • Unbounded lookahead via & / ! predicates (syntactic predicate operators): expr = A !B means match A only if not followed by B, without consuming B.
  • No lexer phase: the grammar parses characters directly, sidestepping the lexer/parser impedance mismatch (handy for DSLs with fluid token boundaries).
  • Deterministic + unambiguous: any grammar that type-checks produces exactly one parse per input (or fails).
  • Memoization (packrat parsing) yields linear time at the cost of memory. Many implementations (including parslet) do not memoize by default — they rely on grammars that don't re-enter the same rule at the same position.
  • Left-recursion is natively rejected. Traditional PEG implementations require rewriting left-recursive rules into right-recursive or iterative form. Some modern PEG tools (e.g., Medeiros' left-recursion extension) handle it; parslet does not.

Operator-precedence layering idiom

The canonical PEG idiom for encoding binary operator precedence is layered right-recursion: one rule per precedence level, each rule matching at least one higher-precedence expression, optionally followed by its operator and a recursive tail at its own level.

or_operation  = and_operation (OR or_operation)?
and_operation = primary       (AND and_operation)?
primary       = "(" or_operation ")" | atom

Because or_operation is the root and recurses through and_operation before consuming an OR, AND binds tighter than OR. Parenthesised expressions re-enter at the lowest precedence (the root), so grouping cleanly overrides precedence.

This is the exact skeleton the systems/parslet library ships as its boolean_algebra.rb example, which GitHub's Issues-search rewrite post quotes as the grammar core. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Why PEG over CFG for production query languages

  • Ambiguity doesn't silently creep in. LALR / earley parsers can produce parser tables that accept ambiguous grammars; PEG forbids it by construction.
  • Grammar reads as code, not as a notation the team has to re-learn during on-call.
  • Easy to add new syntax without breaking old syntax. The GitHub Issues rewrite had to accept both the legacy flat query form (assignee:@me label:support new-project) and new nested form (author:A AND (label:bug OR label:epic)). PEG's ordered alternation lets the grammar have one entry rule that tries the nested form first and falls back to the flat form, with tests enforcing no legacy query has a nested-form parse that disagrees.
  • Error messages are grammar-literal. Failures point at exactly which rule failed where, which is useful for showing syntax errors in a search box.

Trade-offs

  • Tooling is less mature than CFG ecosystems (yacc/bison derivatives, ANTLR). For very large grammars (full programming languages), tooling gaps show up.
  • Left-recursion rewrite tax. Right-recursive alternatives (as above) are mechanical but can obscure intent vs a natural left-recursive spec.
  • Performance without packrat can degenerate to exponential on pathological inputs. In practice, most query-DSL grammars are linear-time on human-written inputs without memoization, but adversarial input requires care.

Seen in

Last updated · 200 distilled / 1,178 read