Skip to content

CONCEPT Cited by 1 source

Metric cardinality

Metric cardinality is the number of unique combinations of label values a metric has — e.g., cpu_usage{pod="...", tenant="..."} has one series per distinct (pod, tenant) pair. Cardinality is the primary scaling factor for a TSDB: memory cost, query cost, storage cost, and block-compaction cost all scale with the count of active series, not with the raw sample rate.

Why it dominates TSDB scaling

A TSDB indexes every (metric_name, label_set) pair it has ever seen in memory; queries touch the index to resolve series before fetching sample data. A 10× increase in pod count with a pod_id label means 10× more series, 10× more index memory, 10× more query fan-out, and 10× more blocks to compact. Sample rate is linear; cardinality is the knob that multiplies.

Serverless amplifies the problem: short-lived workloads (serverless functions, ephemeral VMs, Kubernetes pods on autoscalers) mean each label value has a vanishingly short lifetime. The cardinality keeps growing even if the instantaneous fleet size is stable, because label values churn — each new pod ID is a new series that the TSDB must index before it disappears. See concepts/serverless-workload-churn-cardinality.

Why engineers still add high-cardinality labels

Metric owners add labels like node_id, pod_id, tenant_id, request_id because those dimensions are exactly what makes a metric useful during incidents — aggregated metrics tell you "region CPU is elevated" but can't tell you "which tenant is causing swap pressure, which node crashed, which shard is isolated." High cardinality is a feature, not a bug; the scaling problem is real.

Responses

Three families of responses:

Most hyperscale observability platforms end up combining all three, with tight interface-stability discipline to make the split invisible to users (see concepts/unified-metric-semantics).

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical production framing. "Cardinality is the primary scaling factor for a TSDB, and growth in the cardinality of existing metrics increases costs and scaling pressure on Pantheon." At Databricks: serverless compute launches "tens of millions of VMs daily," which together with the >5B active-timeseries fleet motivated a three-pronged response (Pantheon tiered-storage TSDB, Telegraf aggregation shield, Hydra raw-lakehouse tier).
Last updated · 451 distilled / 1,324 read