Skip to content

PLANETSCALE 2026-03-24

Read original ↗

PlanetScale — Enhanced tagging in Postgres Query Insights

Summary

Rafer Hazen (PlanetScale) announces enhanced query tagging in PlanetScale Insights for Postgres, shipped alongside the Traffic Control launch. Insights has supported SQLCommenter-style query tags for some time, but tags were previously only attached to the individual-query stream (notable queries — any query that took > 1 second, read > 10k rows, or errored). This release extends tagging to the aggregate summary stream (per-query-pattern statistics emitted every 15s), enabling three new surfaces: (1) tag distributions per query pattern, (2) database-wide statistics broken down by tag, (3) tag-filtered query-pattern lists via tag:MY_TAG:MY_VALUE. The architectural core is two cardinality- reduction strategies that keep the aggregate stream from exploding when tags are high-cardinality (like request_id): a per-query-pattern key-cardinality tracker (> 20 distinct values for a tag key on a pattern ⇒ collapse to * for 1 hour) and a per-interval tag-combination limit (> 50 unique combinations in one 15-second interval ⇒ greedily collapse the highest-cardinality tag until under threshold). Collapsed status is recorded on every aggregate so the UI can show "the percentage of tag values where the value is unknown."

Key takeaways

  1. Two query-data streams power Insights — per-query messages for the notable-query tail (> 1s duration OR > 10k rows read OR error) and per-query-pattern aggregate summaries emitted every 15s. Before this release, only the notable-query stream carried tags; aggregates were tag-blind. (Source: sources/2026-04-21-planetscale-enhanced-tagging-in-postgres-query-insights)
  2. Aggregate-stream tagging unlocks three question classes that notable-query tagging alone couldn't answer: "What queries has this user executed?" (including cheap/fast ones not in the notable tail), "What percentage of my total query run time is coming from this controller?" (requires attributing aggregate time per tag value), "Which background jobs are executing this query?" (mapping query pattern → set of callers).
  3. Per-unique-tag-combination messaging is the chosen aggregate shape — emit one aggregate message per distinct combination of tag key-value pairs observed for a query pattern in the 15s window. The rejected alternative — single aggregate message with a tags: {"controller=users": 1, "controller=sessions": 2} counter map — cannot attribute aggregate statistics to any individual tag value, "the summary data for one tag is permanently combined with the data from all tags."
  4. Per-unique-combination messaging doesn't scale without cardinality control. Worst case: customer tags every query with a unique request_id → every execution becomes its own aggregate message, destroying the aggregation benefit. "This would be prohibitively expensive to process and store, and would consume considerable resources on the database host where telemetry data is emitted."
  5. Per-query-pattern key-cardinality tracker (first collapse mechanism): count distinct values per tag key per query pattern; if count exceeds 20, mark that key as collapsed for that pattern for 1 hour; all queries for that pattern during the window carry tag_key=* instead of the specific value. Catches inherently high-cardinality keys like request_id or user_id.
  6. Per-pattern scoping is load-bearing: tracking cardinality globally (across all query patterns) would collapse high-cardinality-but-highly-correlated-with-pattern tags like source_location (file + line number of the query call site) unnecessarily — source_location is high-cardinality globally but typically has 1-2 values per query pattern and doesn't inflate the aggregate message count. Monitoring per-pattern lets it pass through uncollapsed.
  7. Per-interval tag-combination limit (second collapse mechanism): even individually low-cardinality tags can combine explosively — 6 tag keys × 10 values each = 10⁶ possible combinations per query pattern. Within each 15-second interval, track aggregates keyed by the full tag-combination set; if any pattern exceeds 50 combinations, iteratively find the highest-cardinality tag among those aggregates and collapse it (replace all values with *), merging the now-identical aggregates until under the threshold.
  8. Collapse is recorded on the wire: emitted aggregate messages carry a flag indicating that a specific key was collapsed; the UI surfaces "the percentage of tag values where the value is unknown" so users know when tag-based breakdowns are incomplete. Collapse is lossy-by-design but observable.
  9. Three user-facing surfaces (all enabled by aggregate-stream tagging): (a) Query-pattern details page shows tag distribution for one pattern — list of tag values with percentage of queries that included each value; (b) Tags section in Insights sidebar shows database-level aggregate statistics broken down by tag (or set of tags); (c) Query-pattern search via tag:MY_TAG:MY_VALUE returns query patterns filtered to queries carrying that tag pair.
  10. Automatic tags added by Insights (emitted by the extension itself, not the application): application_name (Postgres driver-set), username (Postgres user), remote_address (remote IP). These join application-set tags (controller, action, job, source_location) in the same aggregation pipeline.

Systems / concepts / patterns extracted

Systems

  • systems/planetscale-insights — the Postgres extension, the canonical consumer of these tags; gains aggregate-stream tag surfaces as part of this release.
  • systems/planetscale-traffic-control — co-launched feature; uses the same SQLCommenter tag substrate for workload-class classification; aggregate tagging enables per-tag observability of Traffic Control's workload classes.
  • systems/postgresql — the host database; Insights extension runs inside the Postgres binary.
  • systems/kafka — the two-topic pipeline backbone (individual queries → Kafka topic 1; aggregate summaries → Kafka topic 2).
  • systems/clickhouse — the aggregate storage engine Kafka feeds into; powers query tables, anomaly detection, and query-related graphs.

Concepts (new)

Concepts (extended)

  • concepts/query-tag-filter — the same tag:key:value / tag:key filter syntax now operates across aggregate-stream data (previously only per-error and per-notable-query).
  • concepts/actor-tagged-error — sibling surface using the same tags on error occurrences; pre-existing before this release.
  • concepts/observability — general concept; aggregate-stream tagging is a canonical new instance of trading storage for attribution.

Patterns (new)

Patterns (extended)

  • patterns/actor-tagged-query-observability — pre-existing pattern; aggregate-stream tagging extends its reach from the notable-query tail + error stream to all queries in aggregate, substantially widening the attribution surface.

Operational numbers

  • Notable-query capture threshold (pre-existing, restated): query read > 10,000 rows OR duration > 1 second OR resulted in an error → individual-query message emitted.
  • Aggregate-summary cadence: 1 aggregate message per query pattern (pre-this-release) / per unique tag combination per query pattern (post-this-release), every 15 seconds.
  • Per-query-pattern key-cardinality threshold: 20 distinct values per tag key per query pattern. Exceeding this collapses the key to * for 1 hour on that pattern.
  • Per-interval tag-combination threshold: 50 unique combinations per query pattern per 15s interval. Exceeding this triggers greedy collapse of the highest-cardinality tag until under threshold.
  • Worked example (from the post): five executions of one pattern with 2 distinct controller values and 4 distinct host values = 5 combinations. Collapsing host (highest cardinality, 4 values) leaves 2 aggregates (one per controller value).

Architecture

Postgres extension (Insights)
├── Individual queries topic (Kafka)      [pre-existing: tags carried here]
│     └── queries read >10k rows OR >1s OR errored
│                                          → Notable queries UI
│                                          → Error-occurrences UI
└── Aggregate summaries topic (Kafka)     [this release: tags now carried here]
      └── per-query-pattern, per-unique-tag-combination, every 15s
         ├── Cardinality-reduction layer
         │    ├── Per-pattern key-cardinality tracker (>20 values → collapse key for 1h)
         │    └── Per-interval combination limit (>50 combos → greedy highest-card collapse)
         └── Collapse status flag on every message
                                           → ClickHouse
                                           → Query table + anomalies + graphs
                                           → Tag distribution on pattern details page
                                           → Tags section in sidebar (DB-wide stats by tag)
                                           → Search: tag:MY_TAG:MY_VALUE

Aggregate stream powers "the majority of Insights including the query table, anomalies, and all query-related graphs." Individual stream powers the notable-query tail + error occurrences.

Caveats

  • No production numbers: no disclosure of what percentage of query patterns hit either cardinality threshold in PlanetScale's fleet, no collapse-rate telemetry, no p99 latency impact of the cardinality-reduction layer on the emission path, no storage savings vs a no-collapse baseline.
  • Thresholds stated as chosen values, not derived: 20 / 1h / 50 are stated as "currently" values without derivation. No guidance on tuning for different customer profiles.
  • Collapse fairness unaddressed: when the per-interval limit triggers, which tag value losses are visible to which users? If one query pattern's request_id is collapsed while another's isn't, is that a UX inconsistency?
  • Collapse recovery is coarse: the per-pattern key-cardinality tracker collapses for 1 hour; if the cardinality was transient (e.g. a load test ended), the customer can't see the specific values for an hour even though the underlying cardinality is now low. No disclosed mechanism for fast-recovery or manual reset.
  • Collapse surfacing is quantitative-only: "percentage of tag values where the value is unknown" — the UI shows the fraction but not the identity of the collapsed key, which could be inferred from context but isn't made explicit.
  • Workers vs DB host emission-cost trade-off implied: "would consume considerable resources on the database host where telemetry data is emitted" — hints that the cardinality- reduction layer runs inside the Postgres extension (in the DB host's process space), which has implications for the extension's memory footprint, but the mechanism isn't detailed.
  • Application-emitted tags only partially enumerated: the post lists controller, action, job, source_location illustratively — these are common, not exhaustive. The full set is application-author's choice.
  • No comparison to alternative observability products: Datadog DBM, pganalyze, Percona PMM, pg_stat_statements — each has its own tagging story. No positioning.

Source

Last updated · 347 distilled / 1,201 read