Skip to content

CONCEPT Cited by 1 source

TSDB scaling bottleneck

Scaling a time-series database is an infrequent, disruptive operation for most companies — you provision for peak, over- provision, and only scale up when the growth curve forces you to. At hyperscale, the inverse property holds: the growth curve is fast enough that TSDB scaling is a frequent operation, often daily, and becomes the #1 reliability problem in the monitoring stack.

Why scaling a TSDB is hard

A TSDB holds:

  • An in-memory index — every active (metric, label_set) combination, read on every query.
  • On-disk blocks — the last ~24h of samples, written out in time-sorted blocks that need to be periodically compacted.
  • Object-storage blocks (if using tiered storage like Thanos) — long-term retention.

Scaling up means adding replicas / shards / nodes. The in-memory index must be redistributed; partial blocks must be handed off or re-ingested; the hash ring / assignment must converge; the querier fleet must learn about the new nodes. All while continuing to ingest samples at full rate. Any pause or dropped-sample window is an observability outage.

Why decoupling compute from storage helps

The single biggest architectural move: tiered storage with object storage as the cold tier. Historical data lives in S3 / GCS / Azure Blob, which has effectively unbounded horizontal capacity. The TSDB cluster itself only needs to hold hot (memory) + warm (on-disk 24h) data, which scales with active series, not historical series. Scaling up adds nodes without needing to rebalance historical data across them.

See systems/thanos for the canonical implementation of this pattern. Prometheus itself is bounded by single-instance capacity — real hyperscale deployments all layer one of Thanos / Cortex / Mimir / VictoriaMetrics on top, specifically to get the object-storage cold tier.

Why aggregation still matters

Even with tiered storage, the in-memory + on-disk tiers scale with active cardinality, and that's dominated by label values, not sample rate. A serverless workload launching tens of millions of VMs per day creates unbounded label churn.

The response: a pre-storage aggregation tier that drops expensive labels before samples reach the TSDB. See patterns/aggregation-shield-for-tsdb-cardinality.

Why purpose-built control-plane automation matters

Generic Kubernetes automation (HPA, StatefulSet rolling updates) is insufficient for a hyperscale stateful TSDB. The control loop must preserve quorum invariants during rollouts, must gate node removals on readiness, must remediate data-specific failure modes (WAL corruption, pod overload). At Databricks' scale, these automations fire dozens of times per week — manual operation is not an option. See patterns/purpose-built-control-plane-for-stateful-tsdb.

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical framing datum. "Databricks' old TSDBs had been built for an order of magnitude lower scale, and became a major bottleneck for us in recent years. In fact, the #1 reliability problem for the entire monitoring infrastructure was the difficulty of scaling up our TSDBs. This is an infrequent operation for many other companies, but something we needed to do almost daily given Databricks' exponential growth." Motivated the Pantheon + Telegraf + Hydra three- tier response.
Last updated · 451 distilled / 1,324 read