Skip to content

PLANETSCALE 2024-03-29 Tier 3

Read original ↗

PlanetScale — Identifying and profiling problematic MySQL queries

Summary

Ben Dicken (PlanetScale, 2024-03-29) publishes a pedagogical field manual for native MySQL query diagnosis: how to use performance_schema + sys tables to identify which queries to fix, and EXPLAIN ANALYZE + stage-timing profiling via setup_instruments / setup_consumers / events_stages_history_long to drill into individual query execution. The post closes by positioning PlanetScale Insights as the product that replaces the manual poking-around workflow with a visualisation + anomaly-detection dashboard over the same underlying data.

Key takeaways

  • performance_schema is an in-memory storage engine ("all of the information it tracks is stored in an in-memory PERFORMANCE_SCHEMA storage engine") containing ~113 tables on recent MySQL versions. It's on by default but can be disabled on memory-constrained hosts. (Source: this post.)
  • events_statements_summary_by_digest is the entry point for "which queries are expensive?" — keyed by digest (normalised SQL with literals stripped), it exposes COUNT_STAR, SUM_TIMER_WAIT, AVG_TIMER_WAIT, MAX_TIMER_WAIT, SUM_LOCK_TIME. Timer values are in picoseconds (divide by 1e12 for seconds).
  • sys schema is the ergonomic front-end over performance_schema. Canonical diagnostic tables named: statements_with_sorting, statements_with_runtimes_in_95th_percentile, statements_with_full_table_scans.
  • table_io_waits_summary_by_index_usage surfaces per-index hit counts including reads that bypassed every index (INDEX_NAME IS NULL row). Worked example: 2.57 billion unindexed reads on game.message vs ~164k for the one used index (from_id) — diagnostic signature of missing indexes.
  • Stage-level profiling via events_stages_history_long requires three config toggles: setup_instruments.ENABLED = 'YES', setup_consumers.ENABLED = 'YES', and setup_actors.HISTORY = 'YES'. Workflow: capture thread_id from performance_schema.threads, run the query, look up the statement in events_statements_history_long, bracket with its event_id / end_event_id, then query events_stages_history_long between those bounds to see per-stage milliseconds (stage/sql/executing, stage/sql/optimizing, stage/sql/statistics, stage/sql/Opening tables, stage/sql/waiting for handler commit, ...).
  • Worked stage-profile datum: on the post's problematic query, stage/sql/executing consumed 735.3 ms out of a ~736 ms total — execution-bound, not lock-bound or optimising-bound. The value of stage timing is that it would flag lock-wait or optimiser time if those were the bottleneck.
  • Selective profiling via setup_actors: flip global defaults off with UPDATE setup_actors SET ENABLED='NO', HISTORY='NO' WHERE HOST='%' AND USER='%', then insert a specific (HOST, USER) row to scope instrumentation to a single test principal — avoids the performance tax of fleet-wide history tracking.
  • EXPLAIN ANALYZE reports actual per-iterator costs alongside planner estimates — the worked query shows a nested-loop join with a full 1M-row table scan on message followed by two single-row PK lookups per row, actual 320 ms on 345,454 rows.
  • PlanetScale Insights is the vendor's answer to the tedium: "gleaning this information can be tedious. Getting exactly what you want requires significant poking around and digging through tables in performance_schema and sys." Insights provides visualisations + sort-by-rows-read + automatic anomaly detection over the same underlying digest data.

Systems extracted

  • MySQL — target engine; performance_schema + sys are shipped with the server.
  • InnoDB — storage-engine context for the index-usage table (PRIMARY / secondary keys).
  • PlanetScale Insights — productised observability over the same data; Postgres + MySQL surface.
  • PlanetScale — publisher; positions Insights as the managed-service alternative to manual poking.

Concepts extracted

Patterns extracted

Operational numbers

  • ~113 performance_schema tables on a recent MySQL.
  • Timer unit: picoseconds (divide by 1 trillion for seconds).
  • Worked datum: 735.3 ms in stage/sql/executing out of ~736 ms total wall-clock.
  • Worked datum: 2.57 × 10⁹ unindexed reads vs 164,500 indexed reads on a single table — ratio as diagnostic signal.
  • Worked datum: full-table-scan query running 6,742 times accumulating 6.45 min of latency.
  • EXPLAIN ANALYZE worked datum: nested-loop join with table scan on m (1M rows) + single-row PK lookup on p1 + single-row PK lookup on p2, actual 320 ms for 345,454 rows.

Caveats

  • Instrumentation has overhead — Dicken explicitly flags "(small) adverse effect on the overall performance of your system" from enabling history for all users; selective actor scoping is the mitigation.
  • performance_schema is in-memory — data is lost on restart; no retention beyond memory size; no long-term trend analysis without an exporter.
  • EXPLAIN ANALYZE actually runs the query — unsafe for expensive queries on production and for UPDATE / DELETE / INSERT side effects (already canonicalised on the EXPLAIN ANALYZE wiki page).
  • Timer-unit surprise — picoseconds is easy to misread as nanoseconds; every derived number is 1000× off if the mistake is made.
  • Digest grouping is literal-stripped — queries with different parameters are grouped; non-parameter differences (different table names, different column lists) are separate digests.
  • No production deployment numbers — post is pedagogy; PlanetScale doesn't disclose how Insights' own ingestion of this data scales, nor the retention policy or sampling strategy behind the visualisations.
  • Tier-3 PlanetScale pedagogical voice — Ben Dicken's fifth-plus wiki ingest (canonical database-internals educator); default-include per companies/planetscale skip rules.

Source

Last updated · 378 distilled / 1,213 read