Skip to content

PlanetScale — Optimizing SQL with Query Statistics

Summary

David Graham's 2021-08-31 feature-launch post announces that all PlanetScale database branches now track statistics about every SQL query executed against them, with zero overhead, powered by the Vitess layer that fronts every PlanetScale MySQL. The post frames this as a replacement for the traditional observability primitivestcpdump, SHOW PROCESSLIST, third-party agents — all of which either add load to the running system or capture only a small, transient subset of traffic. PlanetScale's approach is: "By using Vitess, which powers PlanetScale databases, to track how many times a query runs, how many rows it returned, and how long each query takes to complete, we provide a complete view of query traffic on the database." This is the 2021-era genesis of what later became PlanetScale Insights — the post defines a 100 ms slow-query threshold as the flag signal, and demonstrates the feature's value via a walk-through of a production query PlanetScale found in their own fleet — a backup-deletion query whose average runtime dropped from 719 ms to under 20 ms (a 98 % reduction) after reordering the columns in a composite index so the most selective predicate came first.

Key takeaways

  1. Query statistics is a Vitess-layer capability, not a MySQL one. "Starting today, all PlanetScale database branches now track statistics about each SQL query that has executed against it without any overhead. By using Vitess, which powers PlanetScale databases, to track how many times a query runs, how many rows it returned, and how long each query takes to complete, we provide a complete view of query traffic on the database." The statistics captured per query pattern are: execution count, rows returned, execution duration. Because Vitess sits on the query path between the client and MySQL as a proxy, it can observe every query without needing MySQL's performance-schema or slow-log infrastructure enabled. (Source: this article.)

  2. Traditional tools are inadequate for a complete view of query traffic. "When you experience database issues, capturing queries on a running system is a hard problem. Existing solutions, such as tcpdump, SHOW PROCESSLIST; or third-party monitoring agents, can cause additional system load or gather only a small subset of queries." The three named alternatives each have a distinct failure mode: tcpdump captures everything but imposes packet-processing overhead and produces data that's awkward to aggregate; SHOW PROCESSLIST is instantaneous and point-in-time, missing any query that finishes before the next poll; third-party agents sample (not exhaustive) or require host access that PaaS databases don't provide. Vitess's on-path instrumentation avoids all three.

  3. 100 ms is the flag threshold for slow-query candidates. "As a starting point, we flag any query running for longer than 100 milliseconds as a candidate for optimization." This is the first published instance of PlanetScale's canonical slow-query threshold — later inherited and reused across PlanetScale Insights. It is a product-UX heuristic, not a performance SLO: queries slower than 100 ms may be fine (they run in a background job with no user waiting), but they are probably worth a look.

  4. The motivating case study: a backup-deletion query running at 719 ms average. PlanetScale soft-deletes daily backups: mark the row with deleted_at = NOW(), then a background job deletes the external files. The job runs this query to find deletable rows:

SELECT *
FROM backup
WHERE deleted_at IS NULL
  AND expires_at <= CURRENT_TIMESTAMP
ORDER BY id ASC
LIMIT 1000;

Average runtime on the production fleet, as surfaced by the new query-statistics feature: 719 ms. Fortunately this query runs inside a background job (no user waiting), but "it does consume too much time in our job workers and in the database." (Source: this article.)

  1. The fix is column reordering in a composite index, not a new index. The table already had a composite index on (expires_at, data_deleted_at, deleted_at). The EXPLAIN plan showed the planner rejected it and chose a full table scan instead — type: ALL, key: NULL, rows: <100s of thousands>. The reason the optimiser rejected it is the two-part data skew David diagnosed:

  2. 97 % of rows are expired → filtering on expires_at < CURRENT_TIMESTAMP matches almost every row, so the leading column of the index is not selective for this query.

  3. 97.5 % of rows are soft-deleted → filtering on deleted_at IS NULL matches only 2.5 % of rows, so this predicate is highly selective but cannot be the leading key lookup of an index whose leftmost column is expires_at (the leftmost-prefix rule).

The fix: swap the column order to (deleted_at, expires_at, data_deleted_at). The new EXPLAIN plan: type: range, key: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at, rows: 1, filtered: 100.00. Average runtime: under 20 ms — a 98 % reduction.

  1. The generalised lesson: put the most-selective column first, but the answer depends on the workload. "We often include deleted_at as a trailing key in composite indexes for this reason, but this isn't quite right for the backup table." Most of PlanetScale's tables follow the common pattern where deleted_at is rarely populated (most rows are live) — there, deleted_at belongs as a trailing column. But the backup table inverts the pattern: most rows are soft-deleted (because each daily backup replaces the previous day's), so for this workload deleted_at IS NULL is the query's highly selective predicate. The composite-index column ordering is a workload-dependent decision, not a universal rule — canonicalised as composite-index column ordering.

  2. Bad index design degrades superlinearly as the table grows. "Full table scan performance further degrades as the table size grows, so this query would have eventually consumed enough time inside the database to impact other queries that are servicing web requests. By optimizing this query, we've ensured that this process will never impact the user experience." The 719 ms was tolerable today; it would not have been tolerable with 10× the backup rows. Query statistics gave PlanetScale a way to see the problem coming before the scan cost became a production incident.

  3. The workflow is: query statistics → identify slow pattern → EXPLAIN → redesign index → validate with EXPLAIN. The article walks this loop end-to-end. Query-statistics flags the 719 ms query; EXPLAIN diagnoses why the existing index isn't used (full table scan); the fix (reorder composite columns) is a hypothesis; EXPLAIN on the new index validates the hypothesis (type: range, rows: 1). This is the canonical EXPLAIN-driven index verification pattern applied to composite-index redesign.

Architectural numbers

Metric Value
Average runtime, old index 719 ms
Average runtime, new index < 20 ms
Runtime reduction ~98 %
Slow-query flag threshold 100 ms
deleted_at IS NOT NULL row fraction (backup table) 97.5 %
expires_at < NOW() row fraction (backup table) 97 %
Old index estimated rows (full scan) "100s of thousands" (every row)
Old index estimated filtered 5.00
New index estimated rows 1
New index estimated filtered 100.00
Old index access type ALL (full table scan)
New index access type range
Old index possible_keys index_backup_on_expires_at_and_data_deleted_at_and_deleted_at
New index key_len 17 (bytes)
Batch size (LIMIT) 1000

Architecture artefacts in the post

Two literal MySQL EXPLAIN output blocks — before (full table scan) and after (range scan on the new composite index) — are shown, both rendered in the post as MySQL's standard \G-formatted vertical-per-column layout. The before plan has:

         type: ALL
possible_keys: index_backup_on_expires_at_and_data_deleted_at_and_deleted_at
          key: NULL
         rows: <100s of thousands—every row in the table>
     filtered: 5.00
        Extra: Using where

The after plan has:

         type: range
possible_keys: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at
          key: index_backup_on_deleted_at_and_expires_at_and_data_deleted_at
      key_len: 17
         rows: 1
     filtered: 100.00
        Extra: Using index condition; Using filesort

The shift from type: ALL with key: NULL to type: range with the composite index chosen is the canonical visual signature of a query going from full-scan to index-scan — and is the fingerprint the EXPLAIN-driven index verification pattern teaches engineers to look for.

Caveats and context

  • 2021-era feature launch, not a retrospective. The post is the debut of per-query statistics in PlanetScale; later posts (2024-2026 era) describe the deeper PlanetScale Insights evolution — query digests, AI-powered index suggestions via HypoPG, SQLCommenter-based query tagging, and graceful degradation through Resource Budgets in Traffic Control. This post establishes the foundational primitive (Vitess-layer query telemetry) that all those later features build on.
  • The selectivity asymmetry is workload-specific. PlanetScale's own general guidance is that deleted_at typically belongs as a trailing column in composite indexes because most rows are live. The backup table is an exception because daily backups replace their predecessors, creating an unusually high soft-deleted row fraction. Generalising "put deleted_at first" would be as wrong as "always put deleted_at last" — the right rule is order by selectivity for this query's data distribution.
  • Extra: Using filesort in the new plan reflects the ORDER BY id ASC clause — the index keys on (deleted_at, expires_at, data_deleted_at) cannot satisfy the ordering, so MySQL sorts the filtered result set. For 1 row this is cheap; for larger result sets this would matter and might justify a four-column index (deleted_at, expires_at, data_deleted_at, id).
  • No published internal architecture of the telemetry pipeline in this post. The article describes what the feature surfaces, not how Vitess collects, aggregates, or stores the statistics. That architecture (Kafka publication, in-memory coalescing, sharded MySQL rollup tables) is disclosed in later posts such as 2026-04-21-planetscale-storing-time-series-data-in-sharded-mysql-to-power-query-insights.

Source

Last updated · 550 distilled / 1,221 read