PlanetScale — Introducing Insights Anomalies¶
Summary¶
Rafer Hazen (PlanetScale, 2023-11-28) launches Insights Anomalies, a new surface under the Insights tab that automatically flags periods of database unhealthiness. The central architectural choice: define database health as the percentage of queries that ran unusually slowly compared to the last week's baseline for their own fingerprint, not against any global latency threshold. Per-pattern baselines are stored as probabilistic percentile sketches in MySQL; a query counts as unusually slow if its execution time exceeds 2σ (97.7th percentile) of the rolling-week distribution for its fingerprint. When an anomaly fires, Insights also surfaces correlated queries — query patterns whose execution-rate time series has the highest Pearson correlation coefficient against the overall database health metric during a window bracketing the incident. This post canonicalises the per-pattern-baseline discipline, percentile-sketch storage, two-sigma outlier framing, and correlation-based root-cause surfacing as a complete anomaly-detection + triage pipeline.
Key takeaways¶
- Database health = per-pattern outlier percentage, not a global latency threshold. "The anomaly graph shows the percentage of queries that were unusually slow. An 'unusually slow' query is defined as having an execution time exceeding two standard deviations above the mean (also known as 2σ1 or the 97.7th percentile), for queries with the same pattern over the last week." Load-bearing framing: queries that have always been slow stay invisible unless they get materially worse; shifting workload mix does not cause false positives. (Source: this post.)
- Per-query-pattern baseline beats global latency percentile. "If the outlier percentage is elevated, we know that the same query patterns are now taking longer than they did over the last week. This provides a strong signal that the database is encountering a resource bottleneck, and does not result in false positives due to shifting database workloads." Canonicalised as concepts/per-query-pattern-baseline.
- Three-step per-pattern threshold pipeline. (1) Aggregate query response-time distribution by SQL fingerprint — the same literal-stripped shape used by digest-based aggregation; (2) store a probabilistic sketch of that distribution in MySQL; (3) compute the per-fingerprint 2σ threshold from the rolling one-week window and count incoming queries that exceed it. Canonicalised as patterns/per-pattern-outlier-anomaly-detection.
- Anomaly detail page surfaces three drill-down axes. High-level query metrics (rows read, rows written per second), database resource utilization (CPU, IOPS), and a list of backups and deploy requests running during the anomaly window. Worked example: "an approximately 300% increase in rows written per second during the anomaly" made the write-heavy root cause immediately visible side-by-side with the health graph.
- Correlation-based root-cause: Pearson coefficient of per-pattern execution rate vs. overall health metric. "Because Insights records and stores exact query counts for 100% of your database's query patterns, we can compare the execution rate of every query pattern with the overall database health metrics and identify highly correlated queries… we calculate the Pearson correlation coefficient between the execution rate for each query pattern and the overall database health metric during the anomaly plus a fixed window before and after. We then return the queries with the highest correlation coefficient." Canonicalised as patterns/correlation-based-root-cause-surfacing and concepts/pearson-correlation-root-cause.
- Correlation filtering avoids confident-but-wrong answers. "Not all anomalies have correlated queries, for example, those caused by running a backup on an under-provisioned database cluster, so we exclude results with a correlation coefficient below a certain threshold." Canonical discipline: if no query pattern correlates above threshold, the UI shows nothing rather than surfacing low-confidence garbage.
- 2σ choice is admitted as conventional, not magical. Footnote 1: "There's nothing magical about 2σ or the 97.7th percentile. This value is used because it's a fairly common choice for defining a statistical outlier." Footnote 2 cites the SREcon22 Americas talk by two Google engineers as the external motivation for the 2σ-outlier approach as a health signal.
- Anomaly detection + correlated queries compress triage time. "When correlated queries are present, it can shave hours off the time it takes to find the root cause." The UX target is the worker-tier engineer who already asks "Hey, is something up with the database?" — a reliable answer with a specific-query-pointer replaces battle-tested dashboard scrolling and ad-hoc performance schema queries.
- Storage substrate disclosed: MySQL itself. The percentile sketches are stored "in MySQL" — Insights uses the same engine it is observing as the telemetry store, consistent with PlanetScale's broader pattern of using MySQL + Vitess as the substrate for its observability-side persistence (the 2023-06-23 sources/2026-04-21-planetscale-enhanced-tagging-in-postgres-query-insights post later canonicalised Insights' dual-stream Kafka + ClickHouse pipeline on the Postgres side — this earlier 2023-11-28 post describes the MySQL-side storage of the sketches themselves, a complementary altitude).
- Rollout posture: shipped at GA, tuning-via-feedback loop. "All PlanetScale databases have access to the Anomalies tab in Insights today. User feedback helps us tune the system to improve accuracy." No private-preview gate, no opt-in toggle named in the post.
Systems extracted¶
- PlanetScale Insights — hosting surface for the new Anomalies tab; per-query telemetry feeds per-pattern sketches; same SQL-fingerprint primitive used for digest-based performance analysis is re-used as the aggregation key.
- MySQL — named storage substrate for the percentile sketches; the observed engine doubles as the telemetry store.
- PlanetScale — publisher; feature is shipped across the PlanetScale MySQL fleet at GA.
Concepts extracted¶
- concepts/two-sigma-outlier-threshold — new canonical page: 2σ / 97.7th-percentile as a convention for statistical-outlier classification, cited in the post as "a fairly common choice for defining a statistical outlier." External reference: SREcon22 talk by two Google engineers.
- concepts/percentile-sketch — new canonical page: probabilistic data structure for per-group response-time distributions, stored in MySQL by Insights, enabling quantile queries (here: the 97.7th-percentile threshold) without retaining individual observations.
- concepts/per-query-pattern-baseline — new canonical page: the architectural discipline of comparing each query instance to a baseline for its own shape, not against a global latency percentile. Apples-to-apples semantics; robust to workload mix shifts.
- concepts/database-health-metric — new canonical page: the overall "% of unusually slow queries" signal, built from per-pattern outlier counts, used as the primary signal on the Anomalies graph.
- concepts/pearson-correlation-root-cause — new canonical page: Pearson correlation coefficient between a per-query-pattern execution-rate time series and the overall database health metric, computed over the anomaly window plus a fixed bracket, used to surface likely-causal queries.
- concepts/query-digest — already on wiki; re-used as
the same SQL fingerprint primitive — the
literal-stripped shape is both the aggregation key for
the
sysschema diagnostic views and the grouping key for the Insights sketches. - concepts/sqlcommenter-query-tagging — already on wiki; complement to the anomaly-detection surface (tags attribute per-query-pattern rate to the caller; the Anomalies feature operates on the shape-keyed stream, but the same Insights pipeline captures both).
- concepts/observability — extended with anomaly-detection axis (per-pattern outlier rate, not per-metric threshold alarm).
Patterns extracted¶
- patterns/per-pattern-outlier-anomaly-detection — new canonical pattern: aggregate by fingerprint → store rolling-week percentile sketch → derive per-pattern 2σ threshold → count outliers as the database-health signal.
- patterns/correlation-based-root-cause-surfacing — new canonical pattern: rank candidate causes by Pearson correlation between each candidate's time series and the alarm metric's time series during a bracketed anomaly window; suppress results below a minimum-correlation threshold to avoid confident-but-wrong suggestions.
Operational numbers¶
- Outlier threshold: 2σ above mean = 97.7th percentile, per-fingerprint.
- Baseline window: last 7 days of same-fingerprint executions.
- Drill-down axes on anomaly page: 3 — high-level query metrics (rows r/w per second), resource utilisation (CPU, IOPS), concurrent ops (backups + deploy requests).
- Worked example: ~300% increase in rows-written-per- second during the anomaly pinpoints the workload origin from the drill-down metrics alone.
- Correlation-coefficient gating: unspecified numeric floor; post only says "below a certain threshold" results are suppressed.
- Triage-time reduction: "shave hours off" when a correlated query is surfaced (qualitative, no measured distribution).
- Storage substrate: percentile sketches stored in MySQL (same engine being observed).
- Query-pattern coverage: 100% ("Insights records and stores exact query counts for 100% of your database's query patterns").
Caveats¶
- 2σ is a convention, not a derived optimum. Post admits the choice; workloads where the slow-query distribution is heavy-tailed or bimodal may need a different threshold. No per-fingerprint adaptive- threshold story disclosed.
- One-week baseline has scale-time semantics. A workload that organically slowed over the past week normalises the new normal into the baseline; a regression introduced and held for more than 7 days becomes invisible. Canonical gap named by construction of the rolling-window semantics but not addressed in the post.
- Percentile-sketch error bounds not disclosed. The post links to the time-series-data-in-sharded-MySQL piece for the sketch mechanism but doesn't name the algorithm (t-digest? KLL? HDR histogram?), error/accuracy bounds, or memory cost per fingerprint.
- Sketch build + storage cost not quantified. How many fingerprints per DB at steady state? How does sketch storage scale with fingerprint cardinality? Not disclosed.
- Correlation is not causation. Post names the filtering-via-correlation-floor discipline but does not address spurious-correlation risk: a query whose rate happens to track the health metric via a shared hidden cause (e.g. both depend on user traffic) will surface even though fixing it does nothing.
- Correlation window size not disclosed. "A fixed window before and after" is the only framing — the window width is a free parameter with real consequences (too narrow: noisy; too wide: drags in unrelated traffic).
- Backup / deploy-request correlation is UI-only. The post lists "backups and deploy requests running when the anomaly occurred" as a drill-down axis but does not quantify correlation against those; the operator has to eyeball the temporal overlap.
- Single-cluster scope. Post describes per-database anomaly detection; multi-region / replica-topology / failover semantics not addressed.
- No false-positive / false-negative rate disclosed. "We've found it to be a reliable indicator" is the only quantitative framing. No precision/recall numbers.
- Tier-3 PlanetScale vendor post. Rafer Hazen's fourth wiki ingest (after Insights launch, debugging-errors, enhanced-tagging-in-insights). Default-include per companies/planetscale skip rules as a database-internals / architecturally substantive voice even within product-launch framing.
Source¶
- Original: https://planetscale.com/blog/introducing-insights-anomalies
- Raw markdown:
raw/planetscale/2026-04-21-introducing-insights-anomalies-bf996a9b.md
Related¶
- systems/planetscale-insights — hosting surface; same fingerprint primitive is reused as the sketch key.
- systems/mysql — named storage substrate for the sketches.
- systems/planetscale — publisher; feature ships at GA across MySQL fleet.
- concepts/two-sigma-outlier-threshold — new canonical page.
- concepts/percentile-sketch — new canonical page.
- concepts/per-query-pattern-baseline — new canonical page.
- concepts/database-health-metric — new canonical page.
- concepts/pearson-correlation-root-cause — new canonical page.
- concepts/query-digest — fingerprint primitive re-used.
- concepts/sqlcommenter-query-tagging — complementary tag-attribution stream under same Insights pipeline.
- concepts/observability — extended with per-pattern outlier anomaly axis.
- patterns/per-pattern-outlier-anomaly-detection — new canonical pattern.
- patterns/correlation-based-root-cause-surfacing — new canonical pattern.
- companies/planetscale — publisher; Recent-articles extended.