CONCEPT Cited by 1 source
Serverless workload churn cardinality¶
Serverless and ephemeral-workload platforms multiply metric cardinality faster than steady-state fleet growth would suggest. The mechanism: label values (pod IDs, VM IDs, tenant IDs) have vanishingly short lifetimes. Every time a new VM launches, its identifier becomes a new metric-label-value — and therefore a new unique series in the TSDB index — even if the total fleet size is constant.
Why steady-state rates don't tell the whole story¶
If a fleet of 10,000 VMs persists for months, cardinality from
the vm_id label is ~10,000. If the same 10,000 VMs turn over
every minute, the TSDB's active-series count can balloon into
millions per hour, because the index retains references to
recently-departed identifiers until retention expires.
Concretely at Databricks: "our serverless compute platform launches tens of millions of VMs daily." The churn rate is the first-order cardinality driver, not the instantaneous fleet size.
Design responses¶
- Memory-retention tiers matched to workload lifespan — run a shorter memory-retention window on the Thanos Receive group that ingests ephemeral-workload metrics. Databricks runs a 30-minute retention for serverless workloads vs 2 hours for long-lived services. See patterns/thanos-receive-groups-with-memory-retention-tiers.
- Aggregation shield — drop the churn-prone labels (pod ID, VM ID) during ingestion, keep only stable dimensions (region, service, tenant). Bounds TSDB cardinality to the number of distinct aggregation keys. See patterns/aggregation-shield-for-tsdb-cardinality.
- Raw-data tier elsewhere — keep full-cardinality raw data in a horizontally-scalable lakehouse (see systems/hydra) for incident debugging, not in the TSDB.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical production disclosure. "As more workloads move over to serverless, the infra we're monitoring becomes higher-churn, and the lifetime of these identifier labels keeps getting shorter." Databricks' response: 30-minute Receive-group retention for ephemeral workloads (vs 2h for persistent services), a Telegraf + Dicer aggregation shield, and Hydra for raw-data access outside the TSDB.