DATABRICKS 2026-05-05 Tier 3

Databricks — 10 trillion samples a day: Scaling beyond traditional monitoring infra¶

Summary¶

Databricks' monitoring infrastructure (previously built for an order- of-magnitude lower scale) has more than tripled in the last year to 5 billion active in-memory timeseries and >10 trillion samples ingested per day across ~70 cloud regions on three major clouds. The post describes three coupled architectural responses: (1) Pantheon — a Databricks fork of the CNCF Thanos TSDB project scaled to 160+ instances (largest: 300M in-memory timeseries, ~1,000 PromQL QPS) with two Receive groups on memory-retention tiers (2h for long-lived, 30m for ephemeral serverless workloads) and a purpose- built control plane (Rollout Operator / Hashring Controller / Autoscaling + Self-Healing Controller) replacing ad-hoc Kubernetes automation; (2) an aggregation pipeline built on Telegraf + Dicer auto-sharder that shields Pantheon from long-term cardinality growth by dropping expensive labels during ingestion — sustaining >1GB/s per region, absorbing an infra- incident-driven 2-5× metric surge so Pantheon only saw a 20% bump; (3) Hydra — a lakehouse-native platform for raw high-cardinality troubleshooting data: Spark Structured Streaming + Databricks Auto Loader ingest 20 billion unaggregated active timeseries into Delta Lake with ~5 min end-to-end freshness and ~50× cheaper storage than Thanos, queried through a PromQL-to-SQL translation layer so Grafana / existing dashboards work against Delta tables unmodified. Migration savings: "millions of dollars in annual cloud costs while reducing monitoring infrastructure downtime by ~5× and eliminating many sources of manual toil."

Key takeaways¶

TSDB scaling was the #1 reliability problem in the entire monitoring stack. "Databricks' old TSDBs had been built for an order of magnitude lower scale, and became a major bottleneck for us in recent years. In fact, the #1 reliability problem for the entire monitoring infrastructure was the difficulty of scaling up our TSDBs. This is an infrequent operation for many other companies, but something we needed to do almost daily given Databricks' exponential growth." (Source: sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring). Canonical instance of TSDB scaling as a first-order reliability concern at hyperscale.
Thanos tiered storage as the foundational primitive for Pantheon. "A key element of Thanos is its tiered storage architecture. The most recent timeseries are kept in-memory, the last 24 hours' timeseries are kept on-disk, and all older data is kept on object storage. This means alerts and other real- time queries can meet strict performance requirements ... At the same time, leveraging object storage allows the system to essentially decouple compute from storage; a cluster can scale up without needing to rebalance all its historical data across database nodes." Canonical instance of three-tier hot/warm/cold storage applied to the TSDB altitude.
Two Receive groups on distinct memory-retention tiers. "We deploy two Receive groups with distinct memory-retention policies: one optimized for long-lived timeseries from persistent services, keeping two hours of samples in memory, and another optimized for short-lived timeseries from Databricks' ephemeral workloads, keeping only 30 minutes' worth in memory. This split reflects the lifespan we observed for serverless workloads at Databricks, and significantly reduces memory footprint and cloud cost while preserving correctness." Canonical instance of patterns/thanos-receive-groups-with-memory-retention-tiers and canonical production datum for concepts/serverless-workload-churn-cardinality (serverless VM launch rate drives label-cardinality growth, short-lived VMs justify shorter memory retention).
Three isolated StatefulSets per Receive group — preserving three-way replication with stronger isolation. "Each group is intentionally implemented as three isolated Kubernetes StatefulSets, corresponding to three replicas, instead of a single large hash ring. This design preserves three-way replication with quorum writes, while providing stronger operational and data isolation. This setup allows us to roll or restart an entire StatefulSet in parallel during releases or node rotations without violating quorum or impacting write availability, which materially simplifies day-to-day operations." Anti-pattern counterpoint: the naive "one large hash ring" would couple all three replicas into a single operational blast radius; splitting them into three StatefulSets lets the team roll one replica at a time without violating quorum.
At-least-once uploads — only 2 of 3 StatefulSets upload blocks. "To further optimize cost while preserving correctness, only two of the three StatefulSets upload blocks to object storage. This reduces redundant upload traffic and cloud storage costs while maintaining data durability and consistency guarantees through replication and quorum semantics." Canonical instance of at-least- once uploads — the insight being that full triple-upload is over-durable given that two uploads already give at-least-once delivery, and Thanos compaction deduplicates identical blocks.
A purpose-built control plane replacing generic Kubernetes automation. "At our global scale, manual operations, best-effort Kubernetes automation, or vanilla Thanos behaviors are insufficient. Every release, scale event, or host failure must be handled safely, automatically, and with minimal human intervention, while preserving quorum and data availability." Three named controllers: Rollout Operator (coordinates releases across three isolated Receive StatefulSets, ensuring at most one replica is unavailable at any time), Hashring Controller (manages which Receive endpoints are visible to the router — "Only healthy, fully ready pods are added to the hashring, and removals are staged during scale-down or maintenance" — decoupling traffic management from pod lifecycle), and an Autoscaling and Self-Healing Controller ("Scales clusters based on Pantheon-specific ingestion and resource pressure rather than generic Kubernetes signals. A built-in healer system continuously detects and remediates common failure modes — such as bad hosts, overloaded pods, or a corrupted WAL — allowing the system to self-recover without operator intervention. At our scale, these automations kick in dozens of times per week."). Canonical instance of patterns/purpose-built-control-plane-for-stateful-tsdb — generic K8s HPA/StatefulSet primitives are insufficient for stateful workloads at TSDB altitude because the control loop must preserve quorum invariants and remediate data-specific failure modes (WAL corruption).
Cardinality is the primary TSDB scaling factor, and serverless growth made it worse. "A metric's cardinality is the number of unique combinations of its labels. If the number of pods you are monitoring increases 10x, so does the cardinality of any metric with a pod ID label. Cardinality is the primary scaling factor for a TSDB ... As more workloads move over to serverless, the infra we're monitoring becomes higher-churn, and the lifetime of these identifier labels keeps getting shorter." Links the structural cardinality-scaling curve to the business trend of serverless adoption ("our serverless compute platform launches tens of millions of VMs daily"). Canonical instance of concepts/metric-cardinality as the first-order scaling constraint + concepts/serverless-workload-churn-cardinality as the amplifier.
Aggregation shield: dropping expensive labels during ingestion to 'bend the curve'. "An automated aggregation strategy for metrics has allowed us to 'bend the curve' of cardinality growth, ensuring the monitoring infra doesn't need to scale faster than the rest of Databricks." Mechanism: Telegraf (InfluxData OSS) + Databricks' Dicer auto-sharder, scaled to >1 GB/s in the largest region across thousands of aggregation rules. Canonical instance of patterns/aggregation-shield-for-tsdb-cardinality — the aggregation tier absorbs cardinality growth so the TSDB doesn't have to.
Sticky routing over re-routing: the state-preservation mechanism for stateful aggregators. "These problems are often solved by using a messaging system like Kafka for partitioning assignments and maintaining previous data; this is costly at our scale and adds ingestion delay that impacts real-time usecases. The alternative approach is to store in-memory state in aggregators and reroute metrics between aggregators to honor assignment. However, this leads to data loss when an aggregator is redeployed; in an initial version of our aggregation infrastructure, this behavior made aggregated metrics almost unintelligible to our users. To make this work seamlessly, we instead developed our own aggregation system using Telegraf and Databricks' 'auto-sharder' service Dicer. This architecture uses intelligent sticky routing instead of rerouting metrics across aggregators, which addressed the redeployment failure modes." Canonical instance of patterns/sticky-routing-for-aggregator-state — the key design choice: keep the same metric series routed to the same aggregator across redeployments, trading Kafka's explicit durability against a cheaper, lower-latency in-memory model that relies on Dicer's eventually-consistent assignment for correctness.
Telegraf absorbed a 2-5× surge — Pantheon only saw 20%. "For example, a recent Databricks infrastructure incident resulted in a 2-5x surge in metrics load across various regions. Telegraf absorbed most of this load, and Pantheon only saw a 20% surge, allowing engineers across the company to run debugging and alerting queries without any impact." Canonical production validation datum for patterns/aggregation-shield-for-tsdb-cardinality — the aggregation tier is load-bearing during exactly the moments (incident-driven metric surges) when the TSDB is under reliability pressure.
The aggregation cost: high-cardinality debugging dimensions are lost. "Our aggregation infrastructure allows us to shield Pantheon from exponential cardinality growth, but this comes at a cost — it removes the exact dimensions engineers need during incidents ... Aggregated metrics tell you: Region-level CPU usage is elevated, Service-level latency is spiking. But they don't tell you: Which tenant is causing swap pressure, Which node crashed, Which shard is isolated, Which workload is noisy." Canonical articulation of the aggregation-vs-cardinality tradeoff that motivates Hydra.
Hydra: a lakehouse-native platform for raw troubleshooting data. "Using the best of Databricks' capabilities, we developed a new platform for raw troubleshooting data called Hydra, which has made high-cardinality debugging practical at massive scale. Hydra ingests 20 billion unaggregated, active timeseries from millions of nodes worldwide, while achieving 5 minutes end-to-end data freshness and 50x cheaper data storage than Thanos." Three enabling primitives disclosed: (1) Apache Spark Structured Streaming for "continuous ingestion jobs that incrementally process metric data as it arrives, writing it in Delta Lake" with "exactly-once semantics for reliable ingestion"; (2) Databricks Auto Loader as the high-throughput Structured Streaming source that "tracks and incrementally processes new files without requiring manual listing or state management" — scaling to "near-real-time arrival patterns" on millions of object storage files; (3) per-region partitioned ingestion deploying "independent streaming jobs across geographies [which] enables each pipeline to autoscale independently, minimizes cross-region latency, and reduces blast radius in case of failures." Canonical instance of concepts/lakehouse-native-observability and patterns/dual-tier-observability-tsdb-plus-lakehouse.
PromQL-over-SQL: Grafana integration without changing user workflows. "Hydra integrates directly with Grafana by enabling PromQL queries to run against data stored in Databricks. We built a PromQL-to-SQL conversion layer that translates PromQL expressions into SQL queries executed on Delta tables in the Lakehouse. This approach allows engineers to continue using familiar PromQL syntax and dashboards without modification. At the same time, the underlying queries are executed against large-scale Delta tables rather than an in-memory TSDB." Canonical instance of patterns/promql-to-sql-over-delta-tables + concepts/promql-to-sql-translation — the interface- stability property lets observability data be stored in a fundamentally different substrate (columnar object-storage lakehouse) while preserving the user-facing query language.
Unified metric semantics across TSDB and lakehouse paths. "A key design principle of Hydra is that engineers should not need to understand our ingestion architecture. Whether a metric is accessed through the TSDB-backed aggregated path, or the Lakehouse-backed raw metric path, the interface remains consistent. Metric names, label semantics, and metadata dimensions are unified across environments. Service teams emit metrics once using a standardized interface. The platform handles aggregation, raw preservation, ingestion, storage, and query routing." Canonical instance of concepts/unified-metric-semantics — the emit-once-route-many posture makes the two-substrate decision invisible to the user.
Critical User Journeys (CUJs) as the design primitive for observability interface. "Building Hydra was not just an infrastructure challenge; it was an interface design challenge. From the beginning, we designed Hydra around Critical User Journeys (CUJs) for our engineers rather than around storage layers or ingestion pipelines. Our goal was simple: engineers should be able to work with high-cardinality metrics using the same interfaces they already rely on." Canonical instance of CUJs applied to the observability-platform design altitude.

Architectural numbers¶

5 billion active in-memory timeseries across Pantheon fleet (up from ~1.6B a year ago — "more than tripled").
>10 trillion samples ingested per day.
~70 cloud regions across 3 major clouds (AWS / Azure / GCP).
160+ Pantheon instances (Thanos fork) across all regions.
300 million in-memory timeseries in the largest Pantheon instance.
~1,000 PromQL QPS on the largest instance.
3-node deployments at the small end (wide deployment variety).
2h memory retention for long-lived Receive group, 30m for ephemeral serverless Receive group.
3 isolated StatefulSets per Receive group — preserves 3-way replication with stronger operational isolation.
2 of 3 StatefulSets upload blocks to object storage (at-least-once uploads).
>1 GB/s aggregation throughput in largest region (Telegraf + Dicer).
Thousands of aggregation rules across the pipeline.
2-5× metric surge from a Databricks infra incident absorbed by Telegraf; Pantheon only saw 20% surge.
20 billion unaggregated active timeseries in Hydra.
~5 minutes end-to-end data freshness in Hydra.
~50× cheaper storage than Thanos for the unaggregated tier.
Dozens of self-healing automations / week at Pantheon control plane.
Serverless compute platform launches tens of millions of VMs daily — the upstream cardinality-growth driver.
Migration outcomes: "millions of dollars in annual cloud costs" saved, "~5× reduction in monitoring infrastructure downtime."

Systems and concepts extracted¶

systems/pantheon — new canonical wiki page. Databricks' open-source Thanos fork, scaled to 160+ instances across 70 regions; two Receive groups on distinct memory-retention tiers; three isolated StatefulSets per group; purpose-built 3-controller control plane; at-least-once block uploads; upstream Thanos contributions.
systems/hydra — new canonical wiki page. Databricks' lakehouse-native observability platform for raw high-cardinality troubleshooting data; Spark Structured Streaming + Auto Loader ingest; per-region autoscaling independent streaming jobs; PromQL-to-SQL over Delta tables; ~5min freshness + 50× cheaper storage vs Thanos.
systems/thanos — new canonical wiki page. The CNCF TSDB project Pantheon forks; tiered memory/disk/object-storage architecture; Receive / Querier / Store / Compactor components; multitenancy model with router-layer tenant attribution.
systems/telegraf — new canonical wiki page. InfluxData's OSS agent, deployed at Databricks as the metric-aggregation tier that shields Pantheon from cardinality growth; scaled with systems/dicer for intelligent sticky routing.
systems/dicer — extended with a new canonical production use case: aggregator sharding for metric ingestion, with sticky routing as the state-preservation mechanism.
systems/delta-lake — extended with a new canonical production use case: raw-metric storage for observability, exactly-once Structured Streaming ingestion, columnar scan for PromQL-translated queries.
systems/apache-spark — extended with a new canonical production use case: Structured Streaming as observability ingestion substrate.
systems/grafana — extended with a new canonical production use case: PromQL queries over a SQL-backed lakehouse via a translation layer.
systems/prometheus — extended with a new canonical cross-reference: Pantheon as a Thanos-based TSDB designed to surpass single-Prometheus scaling limits.
systems/databricks-auto-loader — new canonical wiki page. Databricks' auto-file-listing Structured Streaming source; canonical use case for observability ingestion from millions of object-storage files.
concepts/metric-cardinality — new. The primary TSDB scaling factor: unique-combinations-of-label-values determines memory + compute cost more than raw sample rate.
concepts/tsdb-scaling-bottleneck — new. Scaling a TSDB (not an infrequent one-off in typical deployments) is the #1 reliability problem at Databricks scale, where growth demands daily scale-ups.
concepts/serverless-workload-churn-cardinality — new. Serverless workloads (tens of millions of VMs daily) drive cardinality-growth amplification because pod/node-ID labels have vanishingly short lifetimes.
concepts/tiered-storage-hot-warm-cold — new canonical home. Three-tier memory / disk / object-storage architecture that decouples compute from storage in stateful databases.
concepts/at-least-once-uploads-for-cost-reduction — new. Running fewer-than-N uploader replicas for N-way replicated data, relying on deduplication / quorum semantics for correctness.
concepts/sticky-routing — new. Routing mechanism where the same logical entity lands on the same stateful node across routing-map updates, trading explicit durability for lower-latency in-memory state.
concepts/metric-aggregation-as-cardinality-shield — new. Pre-storage aggregation drops expensive labels, decoupling TSDB growth rate from underlying infrastructure growth rate.
concepts/lakehouse-native-observability — new. Observability data stored as governed tables in a columnar object-storage lakehouse, joinable with enterprise datasets under the same access controls as business data.
concepts/promql-to-sql-translation — new. Translation layer that preserves PromQL as user-facing interface while executing queries against SQL-queryable columnar tables.
concepts/unified-metric-semantics — new. Emit-once-route- many posture: metric names / label semantics are consistent across aggregated and raw storage paths, making ingestion topology invisible to the user.
concepts/critical-user-journey — new. User-workflow- centric design primitive for observability platforms; interface stability (Grafana + PromQL unchanged) is the load-bearing constraint.
patterns/thanos-receive-groups-with-memory-retention-tiers — new pattern. Two Thanos Receive groups with different memory- retention windows matched to workload lifespans; short window for ephemeral workloads cuts memory cost.
patterns/purpose-built-control-plane-for-stateful-tsdb — new pattern. Dedicated 3-controller control plane (Rollout Operator / Hashring Controller / Autoscaling + Self-Healing Controller) replacing generic K8s automation; automations fire dozens of times per week at Databricks scale.
patterns/aggregation-shield-for-tsdb-cardinality — new pattern. Place an aggregation tier in front of the TSDB to absorb cardinality growth and incident-driven surges; proved out by Pantheon seeing only 20% of a 2-5× aggregate surge.
patterns/sticky-routing-for-aggregator-state — new pattern. Use an auto-sharder (Dicer) to keep routing stable across redeployments so in-memory aggregator state survives; alternative to Kafka-backed partitioning which adds cost and latency.
patterns/dual-tier-observability-tsdb-plus-lakehouse — new pattern. Run a TSDB for real-time/aggregated queries and a lakehouse for raw high-cardinality queries simultaneously, unified at the metric-semantics layer so users don't distinguish them.
patterns/promql-to-sql-over-delta-tables — new pattern. Translate PromQL into SQL against a columnar lakehouse store, preserving the user-facing query language while fundamentally changing the storage substrate.

Caveats¶

No disclosure of Pantheon's upstream contribution volume or specific edge-case PRs. The post claims "we often uncover Thanos edge cases and performance optimizations and contribute these back to the open-source community" without enumerating specific contributions or linking to upstream PRs.
No disclosure of the aggregation-rule authoring workflow. "Thousands of aggregation rules" are named, but nothing is said about who authors them (central platform team? service teams self-serve?), how conflicts are resolved, or how rule drift is prevented.
No accuracy / correctness-loss metrics for aggregation. Aggregation rules drop labels, which changes the semantics of the metric. The post acknowledges this conceptually ("it removes the exact dimensions engineers need during incidents") and motivates Hydra to recover them — but does not disclose what fraction of label combinations are preserved vs. collapsed, or whether any specific cardinality budget is targeted per service.
No PromQL-to-SQL fidelity / correctness disclosure. The translation layer is claimed to be transparent, but incompatibilities (PromQL range-vector semantics, rate() behavior on partial data, histogram quantile arithmetic) are not discussed. What happens when a PromQL construct has no clean SQL equivalent is not stated.
No Hydra query-latency disclosure. Freshness (~5 min) is named but p50/p99 query latency for PromQL-translated queries over Delta tables is not; the post acknowledges "improving the performance of Hydra so it achieves similar data freshness to Pantheon" as future work, suggesting current query latency is materially worse than Pantheon's.
No blast-radius disclosure for the aggregation tier itself. Telegraf + Dicer is named as the cardinality shield, but its own failure modes (Telegraf pod failure behavior, Dicer assignment-update latency, how the system degrades if aggregation lags) are not discussed.
No disclosure of Pantheon + Hydra cost breakdown at the per-region or per-cluster level. "Millions of dollars in annual cloud costs" saved is aggregate; per-region economics (where is object-storage cheaper? where are compute savings largest?) are not broken out.
No disclosure of control-plane-failure semantics. What happens if the Rollout Operator crashes mid-release? What if the Hashring Controller misroutes traffic? What's the blast radius of a self-healing-controller bug? All named as safety- critical components without disclosed failure-mode analysis.
Single-author post, no incident retrospective. The "2-5× surge absorbed" datum is referenced but no specific incident is walked through.
No disclosure of Hydra's compaction/retention strategy. 20B unaggregated timeseries stored in Delta Lake implies a retention bound, but the exact policy (days? weeks?) and the Delta compaction cadence are not stated.