Skip to content

PATTERN

AST-normalized query fingerprinting

Problem. An observability system needs to aggregate metrics across a stream of SQL executions where every execution has different literal values (id = 1, id = 2, id = 3), different casing, different whitespace. Per- execution metrics don't scale; regex-based literal stripping misses too many equivalences.

Solution. Produce the grouping key by parsing the SQL to an AST, walking the tree to replace literals with placeholders and normalise surface-syntax differences, then hashing the canonical rendering. Two executions share a fingerprint iff their ASTs are equivalent after this normalisation pass.

(Source: .)

Mechanism

  1. Own a SQL parser — or borrow one. PlanetScale reuses Vitess's query-serving-layer parser so the observability substrate and the query-execution substrate agree on parse semantics.
  2. Walk the AST, replacing literal nodes with a generic placeholder (?). Typical literal types: numeric, string, boolean, date, NULL, array literals.
  3. Normalise surface-syntax during the walk:
  4. Keyword case (SELECTselect).
  5. Identifier case (dialect-specific).
  6. Redundant parentheses.
  7. Whitespace collapsing.
  8. Trailing semicolons.
  9. Serialise the normalised AST back to SQL.
  10. Hash the canonical string — the hash is the query pattern fingerprint (concepts/query-pattern-fingerprint).

Why not regex?

Regex-based literal stripping (the common shortcut) breaks on quoting, parens, nested expressions, and keyword casing. The AST-walk approach round-trips through the grammar so all of these collapse by construction rather than by hand- rolled special cases.

PlanetScale's post calls this out explicitly: "Beyond literal extraction, Vitess's AST normalization also helps eliminate surface-level syntactic differences, such as casing differences or the presence of redundant parentheses."

Structural consequence: the fingerprint is the

aggregation axis

Once fingerprinting is stable:

All of these become cheap once the fingerprint collapses the high-cardinality query stream to a low-cardinality pattern stream.

Sharing the parser is a load-bearing choice

PlanetScale's substrate — Vitess's query planner — is already a SQL parser in the request path. Reusing it for telemetry normalisation means:

  • The observability substrate and the execution substrate agree on SQL grammar by construction.
  • Bug-for-bug compatibility — new SQL shapes supported by the planner are supported by fingerprinting too.
  • No duplicate parser to keep in sync.

Systems without a first-party parser on the request path have to pick between a shared-library parser (risk of grammar drift) or building one just for observability (duplicate maintenance burden).

Seen in

  • — canonical wiki disclosure. Rafer Hazen, 2023-04-20, canonicalises Insights' reuse of the Vitess SQL parser for AST-walk literal replacement + surface-syntax normalisation. Worked example: select * from users where id = 123 normalises to select * from users where id = ?, which hashes to the fingerprint.
Last updated · 542 distilled / 1,571 read