CONCEPT Cited by 1 source

Metric cardinality¶

Metric cardinality is the number of unique combinations of label values a metric has — e.g., cpu_usage{pod="...", tenant="..."} has one series per distinct (pod, tenant) pair. Cardinality is the primary scaling factor for a TSDB: memory cost, query cost, storage cost, and block-compaction cost all scale with the count of active series, not with the raw sample rate.

Why it dominates TSDB scaling¶

A TSDB indexes every (metric_name, label_set) pair it has ever seen in memory; queries touch the index to resolve series before fetching sample data. A 10× increase in pod count with a pod_id label means 10× more series, 10× more index memory, 10× more query fan-out, and 10× more blocks to compact. Sample rate is linear; cardinality is the knob that multiplies.

Serverless amplifies the problem: short-lived workloads (serverless functions, ephemeral VMs, Kubernetes pods on autoscalers) mean each label value has a vanishingly short lifetime. The cardinality keeps growing even if the instantaneous fleet size is stable, because label values churn — each new pod ID is a new series that the TSDB must index before it disappears. See concepts/serverless-workload-churn-cardinality.

Why engineers still add high-cardinality labels¶

Metric owners add labels like node_id, pod_id, tenant_id, request_id because those dimensions are exactly what makes a metric useful during incidents — aggregated metrics tell you "region CPU is elevated" but can't tell you "which tenant is causing swap pressure, which node crashed, which shard is isolated." High cardinality is a feature, not a bug; the scaling problem is real.

Responses¶

Three families of responses:

Aggregation shield — drop expensive labels before they reach the TSDB, trading incident-debugging detail for bounded TSDB scaling. See concepts/metric-aggregation-as-cardinality-shield.
Limit / reject at ingestion — enforce per-tenant or per-metric cardinality caps at the router / ingestion tier. Simple but rejects data.
Store raw data elsewhere — keep raw high-cardinality data in a lakehouse / OLAP store that scales horizontally with object storage, and let the TSDB serve only aggregated views. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Most hyperscale observability platforms end up combining all three, with tight interface-stability discipline to make the split invisible to users (see concepts/unified-metric-semantics).

Seen in¶

sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical production framing. "Cardinality is the primary scaling factor for a TSDB, and growth in the cardinality of existing metrics increases costs and scaling pressure on Pantheon." At Databricks: serverless compute launches "tens of millions of VMs daily," which together with the >5B active-timeseries fleet motivated a three-pronged response (Pantheon tiered-storage TSDB, Telegraf aggregation shield, Hydra raw-lakehouse tier).

systems/prometheus — the canonical label-based TSDB
systems/thanos — scales Prometheus but still bounded by cardinality
systems/pantheon — Databricks' scaled Thanos fork
systems/hydra — lakehouse-native alternative
concepts/tsdb-scaling-bottleneck
concepts/serverless-workload-churn-cardinality
concepts/metric-aggregation-as-cardinality-shield
concepts/observability
patterns/aggregation-shield-for-tsdb-cardinality

Metric cardinality¶

Why it dominates TSDB scaling¶

Why engineers still add high-cardinality labels¶

Responses¶

Seen in¶

Related¶