PATTERN Cited by 1 source
Purpose-built control plane for stateful TSDB¶
Replace generic Kubernetes automation (HPA, StatefulSet rolling update controller, pod lifecycle hooks) with a dedicated control plane that understands the quorum invariants, data placement, and failure modes specific to a stateful TSDB.
At hyperscale, the generic K8s primitives are insufficient: they make no guarantees about quorum preservation during rolling updates, they can't stage node removals gracefully against a hash ring, and they don't remediate data-specific failure modes (WAL corruption, TSDB memory pressure, index bloat). Operating manually is not viable when remediation events fire dozens of times per week.
Components¶
At the Pantheon altitude, three dedicated controllers:
Rollout Operator¶
Coordinates releases and scaling across multiple isolated Receive StatefulSets (see patterns/thanos-receive-groups-with-memory-retention-tiers for why there are three per group). Guarantees that at most one replica is unavailable at any time, so quorum writes never block.
Enables parallel StatefulSet updates — three StatefulSets update in sequence (one replica at a time across the cluster), not as three independent rollouts — which dramatically cuts release time without violating quorum.
Replaces K8s' default StatefulSet rolling update, which doesn't know about the quorum invariant and treats each StatefulSet independently.
Hashring Controller¶
Manages which Receive endpoints are visible to the router (i.e., included in the hash ring that distributes incoming writes). Two guarantees:
- Only healthy, fully-ready pods are added — pod readiness as seen by K8s is necessary but not sufficient; the Hashring Controller adds additional TSDB-specific readiness checks (WAL replay complete, memory index warm) before inclusion.
- Removals are staged — during scale-down or maintenance, a pod is first drained (writes stop going to it, reads are completed) before K8s terminates it. Avoids sample loss from in-flight writes hitting a terminated pod.
Replaces the K8s Service / Endpoints controller's default behaviour, which treats readiness as binary and doesn't understand drain semantics.
Autoscaling + Self-Healing Controller¶
Scales clusters based on TSDB-specific pressure signals (ingestion rate, active-series count, memory pressure on the TSDB process, query queue depth) rather than generic CPU / memory utilization. Replaces K8s HPA's metric-server-based autoscaling, which can't express "ingestion is saturating but CPU is low because we're waiting on I/O."
The self-healing component continuously detects and remediates common failure modes:
- Bad hosts (disk failures, kernel panics, degraded network) — the controller drains and replaces before the pod affects ingestion availability.
- Overloaded pods — the controller triggers a targeted scale-up or load rebalance rather than waiting for the pod to fail.
- Corrupted WAL — detected via TSDB-specific liveness probes, remediated by terminating the pod, restoring from replicas, and replaying through the hashring.
At Databricks scale, these remediations fire dozens of times per week — the control plane does the work a human on-call would otherwise have to.
When to reach for this pattern¶
- Stateful database with quorum-based writes where rolling updates must preserve quorum.
- Operating scale where manual operations don't fit in business hours — remediation events in the tens-per-week or more.
- Data-specific failure modes that generic K8s can't detect or remediate.
- A router / proxy tier whose endpoint list needs to be managed distinctly from pod readiness.
When generic K8s is enough¶
- Small or mid-scale deployments (single-digit pods, remediation events at monthly or quarterly cadence).
- Stateless workloads, or stateful workloads without quorum invariants.
- Workloads whose failure modes are detectable by standard liveness / readiness probes.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "At our global scale, manual operations, best-effort Kubernetes automation, or vanilla Thanos behaviors are insufficient. Every release, scale event, or host failure must be handled safely, automatically, and with minimal human intervention, while preserving quorum and data availability." Pantheon's three-controller design canonicalised, plus the datum "these automations kick in dozens of times per week."