PATTERN

AST-normalized query fingerprinting¶

Problem. An observability system needs to aggregate metrics across a stream of SQL executions where every execution has different literal values (id = 1, id = 2, id = 3), different casing, different whitespace. Per- execution metrics don't scale; regex-based literal stripping misses too many equivalences.

Solution. Produce the grouping key by parsing the SQL to an AST, walking the tree to replace literals with placeholders and normalise surface-syntax differences, then hashing the canonical rendering. Two executions share a fingerprint iff their ASTs are equivalent after this normalisation pass.

(Source: .)

Mechanism¶

Own a SQL parser — or borrow one. PlanetScale reuses Vitess's query-serving-layer parser so the observability substrate and the query-execution substrate agree on parse semantics.
Walk the AST, replacing literal nodes with a generic placeholder (?). Typical literal types: numeric, string, boolean, date, NULL, array literals.
Normalise surface-syntax during the walk:
Keyword case (SELECT ⇌ select).
Identifier case (dialect-specific).
Redundant parentheses.
Whitespace collapsing.
Trailing semicolons.
Serialise the normalised AST back to SQL.
Hash the canonical string — the hash is the query pattern fingerprint (concepts/query-pattern-fingerprint).

Why not regex?¶

Regex-based literal stripping (the common shortcut) breaks on quoting, parens, nested expressions, and keyword casing. The AST-walk approach round-trips through the grammar so all of these collapse by construction rather than by hand- rolled special cases.

PlanetScale's post calls this out explicitly: "Beyond literal extraction, Vitess's AST normalization also helps eliminate surface-level syntactic differences, such as casing differences or the presence of redundant parentheses."

Structural consequence: the fingerprint is the¶

aggregation axis

Once fingerprinting is stable:

Counts per fingerprint per interval — how often does this pattern run?
Cumulative time per fingerprint — where is the workload spending time? (patterns/digest-based-query-prioritization).
Latency sketch per fingerprint — p95 per pattern (concepts/ddsketch-error-bounded-percentile).
Rows-read / rows-returned ratio per fingerprint — index-opportunity signal.

All of these become cheap once the fingerprint collapses the high-cardinality query stream to a low-cardinality pattern stream.

PlanetScale's substrate — Vitess's query planner — is already a SQL parser in the request path. Reusing it for telemetry normalisation means:

The observability substrate and the execution substrate agree on SQL grammar by construction.
Bug-for-bug compatibility — new SQL shapes supported by the planner are supported by fingerprinting too.
No duplicate parser to keep in sync.

Systems without a first-party parser on the request path have to pick between a shared-library parser (risk of grammar drift) or building one just for observability (duplicate maintenance burden).

Seen in¶

— canonical wiki disclosure. Rafer Hazen, 2023-04-20, canonicalises Insights' reuse of the Vitess SQL parser for AST-walk literal replacement + surface-syntax normalisation. Worked example: select * from users where id = 123 normalises to select * from users where id = ?, which hashes to the fingerprint.

AST-normalized query fingerprinting¶

Mechanism¶

Why not regex?¶

Structural consequence: the fingerprint is the¶

Sharing the parser is a load-bearing choice¶

Seen in¶

Related¶