SYSTEM Cited by 1 source

Thanos¶

Thanos is the CNCF-incubated open-source time-series database project that extends systems/prometheus with long-term storage via object storage, global query federation, and horizontal scalability — addressing the limits a single Prometheus server hits at tens of millions of active series.

Thanos is the upstream project that Databricks forks internally as systems/pantheon.

Core shape¶

Receive — ingests remote-write samples from Prometheus servers (or directly from instrumented applications), keeps recent data in memory, flushes older blocks to disk and eventually to object storage. Organised into Receive groups implemented as StatefulSets; hash-ring-based partitioning across group members for load distribution.
Querier — federates PromQL queries across Receive nodes, Store gateways, and local Prometheus replicas. Deduplicates overlapping samples from replicated writes.
Store — serves historical blocks out of object storage, letting Querier answer queries over data older than local disk retention.
Compactor — downsamples historical blocks (5m / 1h) and compacts overlapping blocks in object storage; lets queries over long ranges stay fast.
Ruler — evaluates recording / alerting rules against the Thanos data model.

Tiered storage — the foundational scaling primitive¶

Thanos' key architectural move over a single Prometheus is its three-tier storage:

Memory — most recent samples (hours), served at Prometheus-comparable latency.
On-disk — last 24h of blocks, on Receive nodes.
Object storage — all older data (S3 / GCS / Azure Blob), served via Store. Decouples compute from storage: a cluster can scale compute up without needing to rebalance historical data.

See concepts/tiered-storage-hot-warm-cold for the generalised pattern.

Multitenancy¶

Thanos supports multitenancy via tenant-attribution at the router: write requests are annotated with a tenant header, and the router fans out to the correct Receive group. Each tenant's series are logically isolated even when groups are shared at the cluster level.

Edge cases / scaling realities¶

Thanos scales from small 3-node deployments to fleet-scale multi-hundred-instance deployments — but at the high end, real operators report needing to:

Replace the default ("one large hash ring") Receive topology with multiple isolated StatefulSets for operational isolation.
Add memory-retention tiering beyond the default single retention window, to keep ephemeral-workload metrics from dominating memory cost.
Layer custom control-plane automation on top of vanilla Kubernetes primitives — generic HPA / StatefulSet rolling updates are insufficient for quorum-preserving rollouts. See patterns/purpose-built-control-plane-for-stateful-tsdb.
Layer a pre-storage aggregation tier to absorb cardinality growth — see patterns/aggregation-shield-for-tsdb-cardinality.

Databricks' Pantheon (systems/pantheon) is a canonical documented instance of all four of these adaptations layered on top of an upstream Thanos fork.

Seen in¶

sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical source for the Thanos scaling-pattern surface. Databricks' Pantheon is a fork scaled to 160+ instances / 5B active timeseries / 10T samples/day / ~70 cloud regions on 3 major clouds. Largest instance: 300M timeseries / ~1,000 PromQL QPS. Databricks upstreams edge-case fixes and performance optimisations to Thanos. Also canonicalises three operational patterns: patterns/thanos-receive-groups-with-memory-retention-tiers (two Receive groups on 2h vs 30m retention), patterns/purpose-built-control-plane-for-stateful-tsdb (Rollout Operator / Hashring Controller / Autoscaling + Self-Healing Controller replacing generic K8s automation), concepts/at-least-once-uploads-for-cost-reduction (only 2 of 3 StatefulSets upload blocks to object storage). Pantheon also canonicalises multitenancy via router-layer rule-based tenant attribution — tenant is inferred from metric-name + selected labels rather than requiring upstream clients to send tenant headers.

systems/pantheon — Databricks' Thanos fork
systems/prometheus — the Prometheus remote-write source Thanos most commonly ingests from
concepts/tiered-storage-hot-warm-cold
concepts/metric-cardinality
patterns/thanos-receive-groups-with-memory-retention-tiers
patterns/purpose-built-control-plane-for-stateful-tsdb
patterns/aggregation-shield-for-tsdb-cardinality