Skip to content

SYSTEM Cited by 1 source

Pantheon (Databricks Thanos fork)

Pantheon is Databricks' internal fork of the CNCF systems/thanos project, deployed as the primary TSDB under its global monitoring infrastructure. Scale is the distinguishing feature: 160+ instances across ~70 cloud regions on 3 major clouds (AWS / Azure / GCP), with ~5 billion active in-memory timeseries and >10 trillion samples ingested per day. The largest single instance holds ~300 million timeseries and sustains ~1,000 PromQL QPS; the smallest deployments are 3-node clusters.

Pantheon is the stateful-database tier of Databricks' monitoring stack. It sits behind Telegraf + Dicer (the cardinality shield — see patterns/aggregation-shield-for-tsdb-cardinality) and feeds Grafana dashboards + alerting rules. For raw unaggregated high-cardinality troubleshooting, it is complemented by systems/hydra (lakehouse-native).

Why a Thanos fork

Databricks' old TSDBs — built for an order-of-magnitude lower scale — became "the #1 reliability problem for the entire monitoring infrastructure" because scaling up a TSDB was a frequent event rather than a rare one (daily at Databricks growth rates). See concepts/tsdb-scaling-bottleneck. The team chose Thanos for its tiered storage primitive (memory / disk / object storage), which decouples compute from storage: a cluster can scale up without rebalancing historical data across nodes.

Migration outcomes: "millions of dollars in annual cloud costs" saved, "monitoring infrastructure downtime reduced by ~5×," many sources of manual toil eliminated.

Storage architecture

Two Receive groups on distinct memory-retention tiers (see patterns/thanos-receive-groups-with-memory-retention-tiers):

  • Long-lived group — 2 hours of samples in memory, tuned for metrics from persistent services.
  • Ephemeral group — 30 minutes of samples in memory, tuned for short-lived serverless workloads whose labels churn rapidly (canonical instance of concepts/serverless-workload-churn-cardinality). Matching retention to workload lifespan materially reduces memory footprint and cloud cost while preserving correctness.

Three isolated Kubernetes StatefulSets per Receive group (instead of one large hash ring): preserves three-way replication with quorum writes, but gives the team stronger operational and data isolation — rolling or restarting an entire StatefulSet in parallel is safe because it touches at most one replica; the other two preserve quorum.

At-least-once uploads: only 2 of 3 StatefulSets upload blocks to object storage. This cuts redundant upload traffic and storage costs while preserving durability + consistency via replication and quorum (compaction deduplicates identical blocks). See concepts/at-least-once-uploads-for-cost-reduction.

Multitenancy via router-layer tenant attribution: rule-based inference assigns a tenant from the metric name + selected labels, so samples within the same write batch route to different tenants (and therefore different Receive groups) without requiring upstream clients to send tenant headers.

Pantheon control plane

At Pantheon's scale, manual operations + best-effort K8s automation + vanilla Thanos behaviours are insufficient — every release, scale event, or host failure must be handled safely, automatically, and with minimal human intervention, while preserving quorum and data availability. Three dedicated controllers:

  • Rollout Operator — coordinates releases and scaling across three isolated Receive StatefulSets, guaranteeing at most one replica is unavailable at any time. Enables parallel StatefulSet updates without violating quorum.
  • Hashring Controller — manages which Receive endpoints are visible to the router. Only healthy, fully-ready pods are added to the hashring; removals are staged during scale-down or maintenance. Decouples traffic management from pod lifecycle, preventing accidental quorum violations or partial routing during dynamic cluster changes.
  • Autoscaling + Self-Healing Controller — scales clusters based on Pantheon-specific ingestion + resource pressure (not generic K8s CPU/memory signals). A built-in healer detects and remediates common failure modes — bad hosts, overloaded pods, corrupted WAL — allowing the system to self-recover without operator intervention. At Databricks scale, "these automations kick in dozens of times per week."

This is the canonical instance of patterns/purpose-built-control-plane-for-stateful-tsdb: generic Kubernetes primitives (HPA, StatefulSet rolling updates) cannot express quorum-preserving rollouts or data-specific failure remediations.

Operating numbers (2026-05)

  • 5 billion active in-memory timeseries (fleet total).
  • 10 trillion samples ingested per day.
  • 160+ Pantheon instances.
  • ~70 cloud regions across 3 major clouds.
  • 300M timeseries on the largest instance.
  • ~1,000 PromQL QPS on the largest instance.
  • 3-node deployments at the small end.
  • 2h / 30m memory retention on the two Receive groups.
  • 3 isolated StatefulSets per Receive group.
  • 2 of 3 StatefulSets upload to object storage.
  • Dozens of self-healing automations per week.
  • 20% surge on Pantheon from a 2-5× input surge — Telegraf absorbed the rest (canonical validation datum for patterns/aggregation-shield-for-tsdb-cardinality).
  • Growth rate: "more than tripled" in the last year (from ~1.6B to 5B active timeseries).

Upstream contributions

"Because of the breadth, scale, and variety of our deployments, we often uncover Thanos edge cases and performance optimizations and contribute these back to the open-source community." Specific PRs / contributions not enumerated in the post.

Seen in

Last updated · 451 distilled / 1,324 read