Skip to content

CONCEPT Cited by 1 source

Metric aggregation as cardinality shield

Metric aggregation is the transformation of raw high-cardinality metrics into lower-cardinality aggregated series by dropping or collapsing expensive labels during ingestion. When placed as a pipeline tier in front of a TSDB, it acts as a cardinality shield: the TSDB's active-series count is bounded by the distinct aggregation keys, not by the upstream label space.

The shape

Raw input:

metric_name{pod_id="p-123", tenant_id="t-7", region="us-east"}
metric_name{pod_id="p-124", tenant_id="t-7", region="us-east"}
metric_name{pod_id="p-125", tenant_id="t-9", region="us-east"}

Aggregation rule: drop pod_id, group by (tenant_id, region):

metric_name{tenant_id="t-7", region="us-east"} = sum/avg/...
metric_name{tenant_id="t-9", region="us-east"} = sum/avg/...

TSDB sees 2 series instead of 3 — and, crucially, the growth rate is bounded by the count of (tenant, region) pairs, not the count of pod launches.

What it buys

  • Bounded TSDB cardinality — the aggregation output series count grows with aggregation-key combinations, which are typically stable business dimensions (service, region, tenant) rather than churning infrastructure identifiers (pod, node, VM).
  • Surge absorption — an incident that multiplies raw label churn doesn't push through to the TSDB. Canonical datum at Databricks: a 2-5× metric surge during an infra incident resulted in only a 20% surge on Pantheon because Telegraf absorbed the rest. See patterns/aggregation-shield-for-tsdb-cardinality.
  • Real-time ingestion preserved — aggregation runs in- pipeline, not in the query path, so pre-computed aggregate series are served at TSDB latency without query-time aggregation overhead.

What it costs

Aggregation drops the exact dimensions engineers need during incidents — "which pod crashed?", "which tenant is causing swap pressure?", "which node is a noisy neighbour?". To recover those dimensions for debugging, hyperscale deployments complement the TSDB + aggregation shield with a raw-data tier stored elsewhere. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Correctness complications

The aggregation tier is stateful — it holds running counters, percentile reservoirs, histogram buckets. Failure modes the design must handle:

  • Input pod restarts / label changes — counters must continue to increase monotonically rather than dip when an input series disappears.
  • Aggregator redeployments — in-memory state is lost; either it must be rebuilt (from Kafka / replay) or sticky routing must keep the same series at the same aggregator across restarts.
  • Load imbalance across aggregators — the auto-sharder must rebalance without breaking the first two invariants.

Databricks uses Telegraf + Dicer sticky routing to preserve in-memory state across redeployments rather than incurring Kafka's cost + latency.

Seen in

Last updated · 451 distilled / 1,324 read