PlanetScale — Optimizing SQL with Query Statistics¶
Summary¶
David Graham's 2021-08-31 feature-launch post announces that
all PlanetScale database branches now track
statistics about every SQL query executed against them, with zero
overhead, powered by the Vitess layer that fronts
every PlanetScale MySQL. The post frames this as a
replacement for the traditional observability primitives —
tcpdump, SHOW PROCESSLIST, third-party agents — all of which either
add load to the running system or capture only a small, transient
subset of traffic. PlanetScale's approach is: "By using Vitess, which
powers PlanetScale databases, to track how many times a query runs,
how many rows it returned, and how long each query takes to complete,
we provide a complete view of query traffic on the database." This is
the 2021-era genesis of what later became
PlanetScale Insights — the post
defines a 100 ms slow-query threshold as the flag signal, and
demonstrates the feature's value via a walk-through of a production
query PlanetScale found in their own fleet — a backup-deletion query
whose average runtime dropped from 719 ms to under 20 ms (a 98 %
reduction) after reordering the columns in a composite index so the
most selective predicate came first.
Key takeaways¶
-
Query statistics is a Vitess-layer capability, not a MySQL one. "Starting today, all PlanetScale database branches now track statistics about each SQL query that has executed against it without any overhead. By using Vitess, which powers PlanetScale databases, to track how many times a query runs, how many rows it returned, and how long each query takes to complete, we provide a complete view of query traffic on the database." The statistics captured per query pattern are: execution count, rows returned, execution duration. Because Vitess sits on the query path between the client and MySQL as a proxy, it can observe every query without needing MySQL's performance-schema or slow-log infrastructure enabled. (Source: this article.)
-
Traditional tools are inadequate for a complete view of query traffic. "When you experience database issues, capturing queries on a running system is a hard problem. Existing solutions, such as tcpdump, SHOW PROCESSLIST; or third-party monitoring agents, can cause additional system load or gather only a small subset of queries." The three named alternatives each have a distinct failure mode:
tcpdumpcaptures everything but imposes packet-processing overhead and produces data that's awkward to aggregate;SHOW PROCESSLISTis instantaneous and point-in-time, missing any query that finishes before the next poll; third-party agents sample (not exhaustive) or require host access that PaaS databases don't provide. Vitess's on-path instrumentation avoids all three. -
100 ms is the flag threshold for slow-query candidates. "As a starting point, we flag any query running for longer than 100 milliseconds as a candidate for optimization." This is the first published instance of PlanetScale's canonical slow-query threshold — later inherited and reused across PlanetScale Insights. It is a product-UX heuristic, not a performance SLO: queries slower than 100 ms may be fine (they run in a background job with no user waiting), but they are probably worth a look.
-
The motivating case study: a backup-deletion query running at 719 ms average. PlanetScale soft-deletes daily backups: mark the row with
deleted_at = NOW(), then a background job deletes the external files. The job runs this query to find deletable rows:
SELECT *
FROM backup
WHERE deleted_at IS NULL
AND expires_at <= CURRENT_TIMESTAMP
ORDER BY id ASC
LIMIT 1000;
Average runtime on the production fleet, as surfaced by the new query-statistics feature: 719 ms. Fortunately this query runs inside a background job (no user waiting), but "it does consume too much time in our job workers and in the database." (Source: this article.)
-
The fix is column reordering in a composite index, not a new index. The table already had a composite index on
(expires_at, data_deleted_at, deleted_at). The EXPLAIN plan showed the planner rejected it and chose a full table scan instead —type: ALL,key: NULL,rows: <100s of thousands>. The reason the optimiser rejected it is the two-part data skew David diagnosed: -
97 % of rows are expired → filtering on
expires_at < CURRENT_TIMESTAMPmatches almost every row, so the leading column of the index is not selective for this query. - 97.5 % of rows are soft-deleted → filtering on
deleted_at IS NULLmatches only 2.5 % of rows, so this predicate is highly selective but cannot be the leading key lookup of an index whose leftmost column isexpires_at(the leftmost-prefix rule).
The fix: swap the column order to
(deleted_at, expires_at, data_deleted_at). The new
EXPLAIN plan: type: range, key: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at,
rows: 1, filtered: 100.00. Average runtime: under 20 ms —
a 98 % reduction.
-
The generalised lesson: put the most-selective column first, but the answer depends on the workload. "We often include
deleted_atas a trailing key in composite indexes for this reason, but this isn't quite right for the backup table." Most of PlanetScale's tables follow the common pattern wheredeleted_atis rarely populated (most rows are live) — there,deleted_atbelongs as a trailing column. But thebackuptable inverts the pattern: most rows are soft-deleted (because each daily backup replaces the previous day's), so for this workloaddeleted_at IS NULLis the query's highly selective predicate. The composite-index column ordering is a workload-dependent decision, not a universal rule — canonicalised as composite-index column ordering. -
Bad index design degrades superlinearly as the table grows. "Full table scan performance further degrades as the table size grows, so this query would have eventually consumed enough time inside the database to impact other queries that are servicing web requests. By optimizing this query, we've ensured that this process will never impact the user experience." The 719 ms was tolerable today; it would not have been tolerable with 10× the backup rows. Query statistics gave PlanetScale a way to see the problem coming before the scan cost became a production incident.
-
The workflow is: query statistics → identify slow pattern → EXPLAIN → redesign index → validate with EXPLAIN. The article walks this loop end-to-end. Query-statistics flags the 719 ms query;
EXPLAINdiagnoses why the existing index isn't used (full table scan); the fix (reorder composite columns) is a hypothesis;EXPLAINon the new index validates the hypothesis (type: range,rows: 1). This is the canonical EXPLAIN-driven index verification pattern applied to composite-index redesign.
Architectural numbers¶
| Metric | Value |
|---|---|
| Average runtime, old index | 719 ms |
| Average runtime, new index | < 20 ms |
| Runtime reduction | ~98 % |
| Slow-query flag threshold | 100 ms |
deleted_at IS NOT NULL row fraction (backup table) |
97.5 % |
expires_at < NOW() row fraction (backup table) |
97 % |
Old index estimated rows (full scan) |
"100s of thousands" (every row) |
Old index estimated filtered |
5.00 |
New index estimated rows |
1 |
New index estimated filtered |
100.00 |
| Old index access type | ALL (full table scan) |
| New index access type | range |
Old index possible_keys |
index_backup_on_expires_at_and_data_deleted_at_and_deleted_at |
New index key_len |
17 (bytes) |
Batch size (LIMIT) |
1000 |
Architecture artefacts in the post¶
Two literal MySQL EXPLAIN output blocks —
before (full table scan) and after (range scan on the new
composite index) — are shown, both rendered in the post as MySQL's
standard \G-formatted vertical-per-column layout. The before plan
has:
type: ALL
possible_keys: index_backup_on_expires_at_and_data_deleted_at_and_deleted_at
key: NULL
rows: <100s of thousands—every row in the table>
filtered: 5.00
Extra: Using where
The after plan has:
type: range
possible_keys: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at
key: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at
key_len: 17
rows: 1
filtered: 100.00
Extra: Using index condition; Using filesort
The shift from type: ALL with key: NULL to type: range with the
composite index chosen is the canonical visual signature of a query
going from full-scan to index-scan — and is the fingerprint the
EXPLAIN-driven index
verification pattern teaches engineers to look for.
Caveats and context¶
- 2021-era feature launch, not a retrospective. The post is the debut of per-query statistics in PlanetScale; later posts (2024-2026 era) describe the deeper PlanetScale Insights evolution — query digests, AI-powered index suggestions via HypoPG, SQLCommenter-based query tagging, and graceful degradation through Resource Budgets in Traffic Control. This post establishes the foundational primitive (Vitess-layer query telemetry) that all those later features build on.
- The selectivity asymmetry is workload-specific. PlanetScale's
own general guidance is that
deleted_attypically belongs as a trailing column in composite indexes because most rows are live. Thebackuptable is an exception because daily backups replace their predecessors, creating an unusually high soft-deleted row fraction. Generalising "putdeleted_atfirst" would be as wrong as "always putdeleted_atlast" — the right rule is order by selectivity for this query's data distribution. Extra: Using filesortin the new plan reflects theORDER BY id ASCclause — the index keys on(deleted_at, expires_at, data_deleted_at)cannot satisfy the ordering, so MySQL sorts the filtered result set. For 1 row this is cheap; for larger result sets this would matter and might justify a four-column index(deleted_at, expires_at, data_deleted_at, id).- No published internal architecture of the telemetry pipeline in
this post. The article describes what the feature surfaces, not
how Vitess collects, aggregates, or stores the statistics. That
architecture (Kafka publication, in-memory coalescing, sharded
MySQL rollup tables) is disclosed in later posts such as
2026-04-21-planetscale-storing-time-series-data-in-sharded-mysql-to-power-query-insights.
Source¶
- Original: https://planetscale.com/blog/optimizing-sql-with-query-statistics
- Raw markdown:
raw/planetscale/2026-04-21-optimizing-sql-with-query-statistics-abd6a7d7.md
Related¶
- systems/planetscale — the hosted database that ships this feature
- systems/vitess — the query-path component that captures the telemetry
- systems/planetscale-insights — the 2024+ evolution of this primitive
- systems/mysql, systems/innodb — the storage engine underneath
- concepts/query-statistics-telemetry — canonical concept surfaced by this article
- concepts/composite-index-column-order — the workload-dependent ordering rule
- concepts/100ms-slow-query-threshold — PlanetScale's flag threshold
- concepts/composite-index — the underlying data-structure concept
- concepts/secondary-index — the broader index-type concept
- concepts/index-selectivity — the theoretical underpinning of column ordering
- concepts/leftmost-prefix-rule — why the leading column matters
- concepts/mysql-explain — the tool used for diagnosis and verification
- concepts/mysql-access-type —
ALLvsrangedistinction - patterns/explain-driven-index-verification — the workflow applied end-to-end
- patterns/composite-index-for-and-predicate — the pattern composite indexes serve
- companies/planetscale