Skip to content

PATTERN Cited by 1 source

Dynamic cardinality reduction by tag collapse

Pattern: an observability aggregation pipeline that emits one message per unique tag combination runs two dynamic cardinality reducers that observe in-flight data and collapse specific tag values to a placeholder (*) when their cardinality exceeds a threshold. The two reducers cover complementary failure modes: one key with too many values ("request_id has 10,000 values"), and many individually-low-cardinality keys combining multiplicatively ("6 keys × 10 values each ⇒ 10⁶ combinations"). Preserves aggregate totals; loses per-collapsed-value attribution; records collapse status on every emitted message so downstream UIs can surface it.

Problem

Per-unique-combination aggregation (see concepts/aggregate-tag-attribution) is the only shape that attributes aggregate statistics to individual tag values — but it collapses into per-execution messaging under two cardinality pathologies:

  1. One tag key has too many values: a request_id tag ensures every execution is unique; no aggregation possible.
  2. Tag combinations explode: low-cardinality keys compose multiplicatively; 6 keys × 10 values each produces 10⁶ potential combinations per pattern per interval.

A static solution (hand-list tags to drop, or cap total message count with random sampling) is either brittle or lossy in the wrong way.

Solution shape

Two observers running concurrently over the same data:

Observer 1 — per-key cardinality on a pattern ( concepts/per-pattern-tag-cardinality) - Track distinct-value count for each tag key per query pattern, rolling window (hours). - When count exceeds threshold (e.g. 20), collapse that key to * on that pattern for a sticky window (e.g. 1 hour). - Per-pattern scoping lets high-cardinality-but- pattern-correlated tags (source_location) pass through unscathed.

Observer 2 — per-interval combination limit on a pattern (see concepts/per-interval-tag-combination-limit) - Within each aggregation interval (e.g. 15s), track the set of aggregates keyed by full tag tuple per query pattern. - When the set size exceeds threshold (e.g. 50), greedy- collapse the highest-cardinality tag, merge resulting identical aggregates, repeat until under threshold. - Per-interval scope — each window makes fresh collapse decisions; recovers fast when cardinality drops.

Collapse semantics (shared)

  • Collapse marker: replace all instances of the specific value with *. Aggregates that now share an identical tag tuple merge into one message, accumulating the previously-distinct statistics.
  • Total preservation: aggregate totals (count, runtime, rows-read) are unchanged — every execution is still accounted for.
  • Loss: attribution under the collapsed tag becomes impossible — "this key was collapsed; we kept the totals, lost the breakdown."
  • Observability: emit a flag on the message indicating which key was collapsed; the UI can display "percentage of tag values that are unknown for this key." Collapse is lossy but not silent.

Two reducers, not one

A single threshold does not suffice. The per-key reducer catches request_id but misses the combinatorial case where every key is fine individually. The per-combination reducer catches combinatorial explosion but over-collapses when one key is the obvious culprit (it has to collapse some key to get under threshold, possibly a low-value one). Running both lets each target its own failure mode with the tightest appropriate bound.

Generalisation

The pattern generalises beyond query-tagging to any observability aggregation pipeline keyed by a multi-dimensional label tuple:

  • Prometheus-style metrics: label cardinality is the canonical operational failure mode; dynamic per-metric label collapse would prevent cardinality explosions that currently require manual label hygiene.
  • OpenTelemetry attribute reduction: spans and metrics both have attribute tuples; the same two-reducer structure applies.
  • Structured log aggregation: if the pipeline rolls up log events by a set of structured fields, the same explosion modes apply.

The distinctive contribution is having two triggers with different time scales: the sticky per-key trigger catches persistent problems with coarse recovery; the per-interval trigger catches transient blowups with fine recovery. Observability pipelines that use only one tend to either over-collapse (if sticky) or under-react (if per-interval only).

Canonical production instance

PlanetScale Insights' aggregate-stream tagging (Source: sources/2026-04-21-planetscale-enhanced-tagging-in-postgres-query-insights):

  • Observer 1 params: threshold 20, window 1 hour, per-query-pattern scope.
  • Observer 2 params: threshold 50, interval 15s, greedy highest-cardinality-first collapse.
  • Output: aggregate messages flow Postgres extension → Kafka aggregate topic → ClickHouse; collapse status flags surface in the Insights UI as "percentage of tag values where the value is unknown."

Caveats

  • Fairness across patterns: when Observer 2 collapses a tag on a hot pattern but not on a cold one, the UX exhibits per-pattern differences that may confuse users. No disclosed mitigation.
  • Recovery asymmetry: the sticky per-key window means a transient cardinality spike (load test) causes a full hour of collapse even after the spike ends. No disclosed fast-recovery mechanism.
  • Threshold choice is empirical: 20 and 50 are stated without derivation. Customers with different workload shapes or retention budgets might want different thresholds; no tuning interface disclosed.
  • Collapse observability is quantitative only: UI shows percentage of unknown values, not which key was collapsed at what timestamp.

Seen in

Last updated · 347 distilled / 1,201 read