PLANETSCALE 2023-08-10 Tier 3

PlanetScale — Storing time series data in sharded MySQL to power Query Insights¶

Summary¶

Rafer Hazen (PlanetScale, 2023-08-10) discloses the end-to-end pipeline architecture behind PlanetScale Insights: a hybrid telemetry store where low-cardinality database-level metrics live in Prometheus but high-cardinality per-query-pattern metrics live in a sharded MySQL cluster fronted by Vitess. VTGate instances emit two Kafka topics — an aggregate summary per fingerprint every 15 seconds, and slow-query events immediately — both buffered to a bounded 5 MB in-memory queue and flushed asynchronously so query serving never blocks on Kafka. Kafka keys are deterministic hashes of (database_id, fingerprint), landing all messages for a given pattern on the same partition, which lets consumers coalesce aggregates in memory per batch — reducing DB writes by 30–40%. Aggregates land in per-hour and per-minute rollup tables; slow queries land 1:1. Both tables track Kafka (offset, partition) with uniqueness constraints to dedup consumer retries. The Insights cluster is sharded by database_id, runs on small machines (2 vCPU / 2 GB), and was grown from 4 → 8 shards not by Vitess live reshard but by provisioning a new, larger cluster that received dual writes for 8 days (Insights's retention window) before reads cut over. DDSketches run at α = 0.01 (≤1% relative error) and are stored in MySQL as binary BLOB columns read and merged via custom loadable C++ MySQL functions so percentile calculation happens in SQL rather than in the application layer.

Key takeaways¶

Hybrid telemetry store: Prometheus for low-cardinality, sharded MySQL for high-cardinality. "The database-level data fits well into a time series database like Prometheus, but we run into several issues when trying to store the query-pattern level data in a time series database. On any given day, there are 10s of millions of unique query patterns being used across PlanetScale, and we anticipate that this number will continue growing. Most time series databases start having issues when dimension cardinality is that high. We evaluated the cost of storing this data in Prometheus and found that it would be expensive enough to be fairly unattractive." Canonicalises patterns/hybrid-timeseries-store-prometheus-plus-sharded-mysql — route each workload to the store whose cost curve matches its cardinality profile. (Source: sources/2026-04-21-planetscale-storing-time-series-data-in-sharded-mysql-to-power-query-insights.)
Two Kafka topics: aggregates every 15s, slow-query events immediately. "Sends an aggregate summary for each query fingerprint … to Kafka every 15 seconds. 15 seconds is a good balance between keeping the number of messages manageable and providing a near real-time experience. Sends slow query events to a Kafka topic immediately." The 15-second cadence is the canonical per-pattern flush tick. Aggregate-summary topic drives the per-pattern time-series (concepts/per-pattern-time-series); slow-query-events topic drives the slow-query-log UI one-to-one with MySQL rows.
**Instrumentation-site safety limits: per-interval pattern cap
token-bucket slow-query limiter + bounded memory buffer. "A primary design goal for Insights is that the instrumentation should never slow your database down or cause unavailability. We impose several limits at the instrumentation site … We set a limit for the number of unique query patterns per interval … We limit the number of recorded slow query log events using a continuously refilled token bucket rate limiter with a generous initial capacity." And: "Data submitted in VTGate is published to a bounded memory buffer and flushed to Kafka asynchronously. Asynchronous publication minimizes per-query overhead and ensures we continue to serve queries even during a Kafka outage. We guard against a temporary Kafka unavailability by buffering up to 5MB, which will be sent when Kafka becomes available again." Three orthogonal safety primitives canonicalised: token-bucket slow-query limiter (captures bursts but caps throughput), async publication with bounded buffer (5 MB failure-mode budget), and per-interval unique-pattern cap (bounds VTGate memory against cardinality explosion).
Deterministic-hash Kafka key enables in-memory coalescing (30–40% write reduction). "Aggregate query data is mapped to Kafka partitions by setting the Kafka key to a deterministic hash of the database identifier and the query fingerprint. Because of this, all messages for a given database/query pattern will arrive in the same partition and we can merge aggregate Kafka messages in memory for each consumer batch to avoid unnecessary database writes. In practice, we've found that in-memory coalescing decreases database writes by about 30%–40%. Larger batches yield better write coalescing but require more memory in the consumer and increase end-to-end latency. Under normal operations, the average batch size is around 200 messages but can go as high as 1,000 if there is a load spike or we're working through a Kafka backlog. The higher coalesce rate in larger batches helps us quickly burn down message backlogs when they occur." Canonicalises patterns/deterministic-key-hash-for-partition-affinity and concepts/in-memory-coalescing-by-kafka-key: a load-bearing composition where Kafka's same-key-same-partition guarantee is exploited not for ordering but for aggregation locality. Secondary property: larger coalesce batches accelerate backlog burn-down — the write-rate-shedding benefit is self-reinforcing under load.
Per-hour and per-minute rollup tables for dual-granularity reads. "We store the query pattern data in both per-hour and per-minute roll-up tables to serve large time periods efficiently and small periods with high granularity." Canonicalises patterns/dual-granularity-rollup-tables + concepts/per-hour-per-minute-rollup-tables. The time-series UI's zoom-out queries hit per-hour rollups (cheap scan, few rows); zoom-in queries hit per-minute rollups (higher resolution, bounded row count because the window is narrow). Classical time-series-database design pattern applied inside MySQL.
Kafka (offset, partition) uniqueness constraints dedup consumer retries. "For both the aggregate and slow query topics, we track the offset and partition from the underlying Kafka messages in the MySQL tables and use uniqueness constraints to avoid duplicating data if Kafka consumers retry batches following a failure." Canonicalises concepts/kafka-offset-partition-uniqueness-constraint — a general Kafka-consumer idempotence pattern (not Insights-specific) where the (topic, partition, offset) tuple is a natural unique identifier suitable for DB-side deduplication.
DDSketch at α = 0.01, binary-format BLOB column, loadable C++ MySQL functions for in-SQL percentile calculation. "We're using ⍺=0.01 which is sufficiently accurate (estimates can be off by at most 1%) and yields suitably small sketches. Each VTGate instance records a sketch of the latencies for each query pattern and sends it along with the other aggregates every 15 seconds. The sketches are read from Kafka and written to MySQL in a custom binary format. We've implemented a small library of loadable C++ MySQL functions that know how to read and write the binary format, allowing us to aggregate sketches and compute arbitrary percentiles in MySQL. Performing these functions in MySQL allows us to calculate percentiles without needing to pull the underlying sketches into our application. It also lets us use the full expressive power of SQL to get the data we need. We plan to open source the MySQL function library in the near future." Three canonical wiki datums: α = 0.01 is the production-canonical DDSketch relative-error bound (previously referenced by arXiv-number only in the 2023-04-20 sister post); sketches persist as MySQL BLOB columns (concepts/sketch-as-mysql-binary-column); percentile aggregation happens server-side via MySQL loadable C++ functions — query percentiles in SQL, not application code. The architectural property this enables: operators can use "the full expressive power of SQL" — joins against the slow-query table, tag filters, arbitrary predicates — on the same row as the latency sketch.
Shard by database_id; grew 4 → 8 shards by dual-writing to a new, larger cluster for 8 days. "The Kafka consumers issue about 5k writes per second to the MySQL database, and we need to be ready to scale this out as PlanetScale grows. To ensure that the database doesn't become a bottleneck, we sharded the Insights database cluster based on the customer database ID … Database ID works well as a shard key because we never need to join data across customer databases, and it results in a fairly even distribution of data across shards. Insights originally shipped with four shards, but we increased this to eight earlier this year to keep up with increased write volume and to build operation experience resharding. Vitess can re-shard an actively used database, but we opted to provision a new, larger, PlanetScale database when we needed to increase the number of shards. Since Insights currently stores eight days of data, we provisioned a new set of consumers, let the new branch receive duplicate writes for eight days, and then cut the application over to read from the new database." Canonicalises concepts/shard-key-database-id and patterns/dual-write-branch-cutover-via-new-cluster: when retention is short, dual-writing into a fresh cluster for the retention window is simpler than live Vitess reshard because the operator ends with a clean post-retention-aged cluster they have independently validated. Canonical framing: "This method allowed us to test and gain confidence in the new cluster before placing it in the critical path. Based on load tests and resource utilization metrics in production, we've found that our maximum write throughput has so far scaled linearly with the number of shards."
Many small shards (2 vCPU / 2 GB) beats few large ones for backups, schema changes, and headroom. "We've successfully run the Insights database cluster on fairly small machines (2 vCPUs and 2GB memory). A larger number of smaller shards keeps backups and schema changes fast, gives us the option of quickly scaling up to larger hardware if we encounter an unexpected throughput increase, and gives us breathing room to backfill a new cluster with more shards when necessary." Canonicalises patterns/small-shards-wide-fleet: small per-shard resource footprint preserves two vertical escape hatches (scale-up each shard or backfill-to-more-shards) that a wide-but-large fleet cannot exercise without outgrowing cloud instance types.

Systems / concepts / patterns extracted¶

Systems — systems/planetscale-insights (this article discloses Insights' pipeline architecture beneath the fingerprint + UI layer); systems/vitess + systems/vtgate (VTGate is the instrumentation site producing Kafka messages); systems/kafka (dual-topic async transport); systems/mysql (per-pattern storage layer; canonicalised here as a time-series database when the cardinality axis matters more than built-in TSDB functions); systems/planetscale (product — the Insights database cluster is a PlanetScale database, dogfooded); systems/prometheus (low-cardinality half of the hybrid stack).

Concepts (new) — concepts/async-kafka-publication-for-telemetry; concepts/in-memory-coalescing-by-kafka-key; concepts/per-hour-per-minute-rollup-tables; concepts/token-bucket-slow-query-limiter; concepts/shard-key-database-id; concepts/kafka-offset-partition-uniqueness-constraint; concepts/loadable-mysql-function; concepts/sketch-as-mysql-binary-column; concepts/dual-write-branch-migration.

Concepts (extended) — concepts/ddsketch-error-bounded-percentile (α = 0.01 production value + MySQL-binary-BLOB persistence + in-SQL percentile calculation); concepts/query-pattern-fingerprint (15-second flush cadence + Kafka-keying by (database_id, fingerprint)); concepts/per-pattern-time-series (per-hour + per-minute rollup table shape; dual-granularity read path); concepts/horizontal-sharding (database_id as "natural" shard key: no cross-customer joins, even distribution; small-shards-wide-fleet posture).

Patterns (new) — patterns/hybrid-timeseries-store-prometheus-plus-sharded-mysql (route by cardinality profile); patterns/deterministic-key-hash-for-partition-affinity (exploit same-key-same-partition for aggregation locality); patterns/dual-granularity-rollup-tables (per-hour + per-minute for zoom-in / zoom-out reads); patterns/dual-write-branch-cutover-via-new-cluster (provision-and-dual-write for retention window, cut over after validation); patterns/small-shards-wide-fleet (2-vCPU/2-GB shards over few large ones).

Patterns (extended) — patterns/ast-normalized-query-fingerprinting (this post re-confirms the fingerprint is produced at the VTGate layer and is the Kafka key; ties into the partition-affinity mechanism).

Operational numbers¶

Fleet traffic baseline: "more than 10 billion of our customers' queries" processed per day across PlanetScale.
Per-pattern daily cardinality: "10s of millions of unique query patterns … on any given day" — the datum that falsified Prometheus for this axis.
Aggregate-flush cadence: 15 seconds per fingerprint, per VTGate.
Async-buffer failure-mode budget: 5 MB in-memory per VTGate before dropping.
Consumer write load: "about 5k writes per second" to the Insights MySQL cluster.
In-memory coalesce rate: 30–40% DB write reduction via aggregate merging.
Consumer batch size: ~200 messages average, up to 1,000 under load spike / backlog recovery.
DDSketch α: 0.01 (≤1% relative error on percentiles).
Shard count evolution: 4 → 8 shards earlier in 2023.
Retention window: 8 days of Insights data.
Dual-write duration for reshard: 8 days (matched to retention; reads cut over after the new cluster was naturally filled and validated).
Per-shard hardware: 2 vCPUs, 2 GB memory.
Scaling observation: "maximum write throughput has so far scaled linearly with the number of shards."

Caveats¶

"Sharded PlanetScale database" elides its own internals. The post treats "a sharded PlanetScale database, backed by MySQL and Vitess" as a given substrate. VTGate's routing, VTTablet's connection pooling, topo-server's role state are all implicit — covered elsewhere on the wiki (systems/vtgate, systems/vttablet, concepts/vitess-topo-server). The author's own product.
Pipeline pre-dates the later Kafka → ClickHouse migration. The 2026-03-24 enhanced-tagging post (sources/2026-04-21-planetscale-enhanced-tagging-in-postgres-query-insights) canonicalises ClickHouse as the OLAP-side store "Both feed ClickHouse". This 2023-08-10 post canonicalises the sharded-MySQL era beneath it — ~2.5 years earlier. The post explicitly considered and rejected OLAP stores: "A wide variety of OLAP databases could also serve our needs here, but all of them involve significant operational overhead and a steep learning curve." Whatever drove the eventual ClickHouse migration is not disclosed on this post. Historical-architecture datum: ClickHouse earned its way into the Insights pipeline by displacing a working sharded- MySQL system, not by being the initial choice.
Custom loadable C++ MySQL functions are not named, not benchmarked, not open-sourced (as of 2023-08-10). "We plan to open source the MySQL function library in the near future." The post does not disclose the function API (sketch_merge(BLOB) → BLOB? sketch_quantile(BLOB, double) → double? both?), the binary layout, or overhead vs application-side merge.
Worked example is architecture, not war story. No production incident, no cardinality-explosion retrospective, no Kafka-outage recovery story. Pure architecture-disclosure voice — Hazen walks through the mechanism without reporting what went wrong.
Per-interval unique-pattern cap named but not quantified. "We monitor VTGates to ensure that even our largest customers aren't regularly exceeding this threshold." The per-15-second-interval pattern cap is disclosed as a category of safety but no number is given — so the wiki can't quantify when VTGate-side cardinality truncation kicks in.
DDSketch α = 0.01 is the only disclosed sketch parameter. Max bucket count (which controls absolute memory footprint) elided. Bucket-width growth factor γ derived from α (γ = (1+α)/(1−α) = 1.0202) is implicit via DDSketch math but the post doesn't state it.
Resharding experience claim is the only "why not Vitess reshard" rationale. "Vitess can re-shard an actively used database, but we opted to provision a new, larger, PlanetScale database … to build operation experience resharding." The word "experience" elides what concretely motivated the operator preference — risk / validation / blast-radius / team confidence — and what the longer-term plan is (is Insights destined to always reshard via new-cluster-dual-write? or was this a transitional choice?).
Hazen byline — fifth wiki Insights ingest, pipeline-layer altitude. Rafer Hazen is PlanetScale's canonical Insights voice (prior wiki ingests: 2022-09-27 debugging-errors, 2023-04-20 query-performance-analysis, 2023-11-28 introducing-anomalies, 2026-03-24 enhanced-tagging + AI-index- suggestions). This 2023-08-10 post is the pipeline-layer disclosure between the 2023-04-20 fingerprinting substrate and the 2026-03-24 dual-stream ClickHouse canonicalisation. It is the earliest wiki citation of Insights as a Kafka-plus-sharded- MySQL system.
Short post (~1,600 words), dense architectural body. Every paragraph advances a specific primitive. Architecture density ~95% — no marketing filler.