Skip to content

CONCEPT Cited by 1 source

Serverless workload churn cardinality

Serverless and ephemeral-workload platforms multiply metric cardinality faster than steady-state fleet growth would suggest. The mechanism: label values (pod IDs, VM IDs, tenant IDs) have vanishingly short lifetimes. Every time a new VM launches, its identifier becomes a new metric-label-value — and therefore a new unique series in the TSDB index — even if the total fleet size is constant.

Why steady-state rates don't tell the whole story

If a fleet of 10,000 VMs persists for months, cardinality from the vm_id label is ~10,000. If the same 10,000 VMs turn over every minute, the TSDB's active-series count can balloon into millions per hour, because the index retains references to recently-departed identifiers until retention expires.

Concretely at Databricks: "our serverless compute platform launches tens of millions of VMs daily." The churn rate is the first-order cardinality driver, not the instantaneous fleet size.

Design responses

  • Memory-retention tiers matched to workload lifespan — run a shorter memory-retention window on the Thanos Receive group that ingests ephemeral-workload metrics. Databricks runs a 30-minute retention for serverless workloads vs 2 hours for long-lived services. See patterns/thanos-receive-groups-with-memory-retention-tiers.
  • Aggregation shield — drop the churn-prone labels (pod ID, VM ID) during ingestion, keep only stable dimensions (region, service, tenant). Bounds TSDB cardinality to the number of distinct aggregation keys. See patterns/aggregation-shield-for-tsdb-cardinality.
  • Raw-data tier elsewhere — keep full-cardinality raw data in a horizontally-scalable lakehouse (see systems/hydra) for incident debugging, not in the TSDB.

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical production disclosure. "As more workloads move over to serverless, the infra we're monitoring becomes higher-churn, and the lifetime of these identifier labels keeps getting shorter." Databricks' response: 30-minute Receive-group retention for ephemeral workloads (vs 2h for persistent services), a Telegraf + Dicer aggregation shield, and Hydra for raw-data access outside the TSDB.
Last updated · 451 distilled / 1,324 read