CONCEPT Cited by 1 source

Primary-standby failover¶

Primary-standby failover is the operational move of promoting a previously-passive standby cluster (or node, or region) to the primary role when the current primary becomes unavailable. The granularity of the failover — node / cluster / region — determines its risk profile and latency.

Definition¶

Two deployments hold a full copy of the data:

Primary — receives live traffic, is the authoritative source of writes.
Standby — kept in sync via replication (e.g. WAL replication — see concepts/wal-replication). Does not serve live writes; may serve read-only offline workflows, backups, or bulk scans.

On primary failure, ops (or automation) promotes the standby to primary. This requires:

Stopping the old primary (fencing) to prevent split-brain writes.
Draining any in-flight replication from old → new.
Updating client configuration / DNS / service discovery so traffic goes to the new primary.
Optionally, later rebuilding a new standby.

Cluster-level vs per-node failover¶

Primary-standby failover happens at different granularities:

Per-node failover (inside a single cluster) — e.g. MySQL primary dies, a read-replica is promoted. Typically automated, measured in seconds to minutes.
Cluster-level failover (between two whole clusters) — the Pinterest HBase shape: all of cluster A fails, cluster B becomes the new primary for all traffic (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest). Usually human-driven, measured in minutes to hours; the blast radius of a wrong flip is the whole fleet.
Region-level failover — cross-region DR. Even coarser; often involves DNS, rehydrated caches, traffic-engineering changes.

The coarser the unit, the lower the frequency but the larger the consequences of a flip. Cluster-level failover at Pinterest was the escape valve for whole-cluster HBase incidents but was not a routine operation.

Why it works¶

Clean blast-radius separation. The standby is an independent failure domain — bad releases, configuration pushes, or JVM hangs on the primary don't propagate.
Offline workloads can ride the standby. Backups and resource- intensive scans run on standby without disturbing primary p99.
WAL replication keeps RPO bounded. Standby is at most a few seconds behind primary at steady state.

Tradeoffs¶

2× cost. Two full copies of the infra. Combined with three-way intra-cluster replication on each side, Pinterest's HBase shape hit 6 replicas per record — see concepts/replica-cost-tradeoff.
Failover is the rare-path code. Exactly because it's rarely exercised, it tends to rot — unless actively drilled (see concepts/always-be-failing-over).
Split-brain risk if fencing is imperfect. Dual-primary writes during a failed flip can silently corrupt state.

Seen in¶

sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical wiki instance of cluster-level primary-standby failover. Pinterest's HBase production deployments ran as primary
standby pairs inter-replicated by WAL; "Upon failure of the primary cluster, a cluster-level failover is performed to switch the primary and standby clusters."

concepts/wal-replication — the replication substrate that keeps the standby current.
concepts/availability-vs-data-loss-tradeoff — the RPO/RTO dial failover is sitting on top of.
concepts/always-be-failing-over — the discipline that keeps failover working when it's needed.
patterns/primary-standby-wal-replication — the deployment shape this failover operates on.
systems/hbase — the canonical substrate at Pinterest.