CONCEPT Cited by 1 source

Metric aggregation as cardinality shield¶

Metric aggregation is the transformation of raw high-cardinality metrics into lower-cardinality aggregated series by dropping or collapsing expensive labels during ingestion. When placed as a pipeline tier in front of a TSDB, it acts as a cardinality shield: the TSDB's active-series count is bounded by the distinct aggregation keys, not by the upstream label space.

The shape¶

Raw input:

metric_name{pod_id="p-123", tenant_id="t-7", region="us-east"}
metric_name{pod_id="p-124", tenant_id="t-7", region="us-east"}
metric_name{pod_id="p-125", tenant_id="t-9", region="us-east"}

Aggregation rule: drop pod_id, group by (tenant_id, region):

metric_name{tenant_id="t-7", region="us-east"} = sum/avg/...
metric_name{tenant_id="t-9", region="us-east"} = sum/avg/...

TSDB sees 2 series instead of 3 — and, crucially, the growth rate is bounded by the count of (tenant, region) pairs, not the count of pod launches.

What it buys¶

Bounded TSDB cardinality — the aggregation output series count grows with aggregation-key combinations, which are typically stable business dimensions (service, region, tenant) rather than churning infrastructure identifiers (pod, node, VM).
Surge absorption — an incident that multiplies raw label churn doesn't push through to the TSDB. Canonical datum at Databricks: a 2-5× metric surge during an infra incident resulted in only a 20% surge on Pantheon because Telegraf absorbed the rest. See patterns/aggregation-shield-for-tsdb-cardinality.
Real-time ingestion preserved — aggregation runs in- pipeline, not in the query path, so pre-computed aggregate series are served at TSDB latency without query-time aggregation overhead.

What it costs¶

Aggregation drops the exact dimensions engineers need during incidents — "which pod crashed?", "which tenant is causing swap pressure?", "which node is a noisy neighbour?". To recover those dimensions for debugging, hyperscale deployments complement the TSDB + aggregation shield with a raw-data tier stored elsewhere. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Correctness complications¶

The aggregation tier is stateful — it holds running counters, percentile reservoirs, histogram buckets. Failure modes the design must handle:

Input pod restarts / label changes — counters must continue to increase monotonically rather than dip when an input series disappears.
Aggregator redeployments — in-memory state is lost; either it must be rebuilt (from Kafka / replay) or sticky routing must keep the same series at the same aggregator across restarts.
Load imbalance across aggregators — the auto-sharder must rebalance without breaking the first two invariants.

Databricks uses Telegraf + Dicer sticky routing to preserve in-memory state across redeployments rather than incurring Kafka's cost + latency.

Seen in¶

sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical production instance. "An automated aggregation strategy for metrics has allowed us to 'bend the curve' of cardinality growth, ensuring the monitoring infra doesn't need to scale faster than the rest of Databricks." >1 GB/s per region, thousands of aggregation rules, 2-5× surge absorption validated in incident data.

systems/telegraf — canonical open-source aggregation agent
systems/pantheon — TSDB protected by the shield
systems/dicer — sharder for sticky routing
systems/hydra — the raw-data-tier complement
concepts/metric-cardinality
concepts/serverless-workload-churn-cardinality
patterns/aggregation-shield-for-tsdb-cardinality