CONCEPT Cited by 1 source
Primary-standby failover¶
Primary-standby failover is the operational move of promoting a previously-passive standby cluster (or node, or region) to the primary role when the current primary becomes unavailable. The granularity of the failover — node / cluster / region — determines its risk profile and latency.
Definition¶
Two deployments hold a full copy of the data:
- Primary — receives live traffic, is the authoritative source of writes.
- Standby — kept in sync via replication (e.g. WAL replication — see concepts/wal-replication). Does not serve live writes; may serve read-only offline workflows, backups, or bulk scans.
On primary failure, ops (or automation) promotes the standby to primary. This requires:
- Stopping the old primary (fencing) to prevent split-brain writes.
- Draining any in-flight replication from old → new.
- Updating client configuration / DNS / service discovery so traffic goes to the new primary.
- Optionally, later rebuilding a new standby.
Cluster-level vs per-node failover¶
Primary-standby failover happens at different granularities:
- Per-node failover (inside a single cluster) — e.g. MySQL primary dies, a read-replica is promoted. Typically automated, measured in seconds to minutes.
- Cluster-level failover (between two whole clusters) — the Pinterest HBase shape: all of cluster A fails, cluster B becomes the new primary for all traffic (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest). Usually human-driven, measured in minutes to hours; the blast radius of a wrong flip is the whole fleet.
- Region-level failover — cross-region DR. Even coarser; often involves DNS, rehydrated caches, traffic-engineering changes.
The coarser the unit, the lower the frequency but the larger the consequences of a flip. Cluster-level failover at Pinterest was the escape valve for whole-cluster HBase incidents but was not a routine operation.
Why it works¶
- Clean blast-radius separation. The standby is an independent failure domain — bad releases, configuration pushes, or JVM hangs on the primary don't propagate.
- Offline workloads can ride the standby. Backups and resource- intensive scans run on standby without disturbing primary p99.
- WAL replication keeps RPO bounded. Standby is at most a few seconds behind primary at steady state.
Tradeoffs¶
- 2× cost. Two full copies of the infra. Combined with three-way intra-cluster replication on each side, Pinterest's HBase shape hit 6 replicas per record — see concepts/replica-cost-tradeoff.
- Failover is the rare-path code. Exactly because it's rarely exercised, it tends to rot — unless actively drilled (see concepts/always-be-failing-over).
- Split-brain risk if fencing is imperfect. Dual-primary writes during a failed flip can silently corrupt state.
Seen in¶
- sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical wiki instance of cluster-level primary-standby failover. Pinterest's HBase production deployments ran as primary
- standby pairs inter-replicated by WAL; "Upon failure of the primary cluster, a cluster-level failover is performed to switch the primary and standby clusters."
Related¶
- concepts/wal-replication — the replication substrate that keeps the standby current.
- concepts/availability-vs-data-loss-tradeoff — the RPO/RTO dial failover is sitting on top of.
- concepts/always-be-failing-over — the discipline that keeps failover working when it's needed.
- patterns/primary-standby-wal-replication — the deployment shape this failover operates on.
- systems/hbase — the canonical substrate at Pinterest.