Skip to content

CONCEPT Cited by 1 source

Primary-standby failover

Primary-standby failover is the operational move of promoting a previously-passive standby cluster (or node, or region) to the primary role when the current primary becomes unavailable. The granularity of the failover — node / cluster / region — determines its risk profile and latency.

Definition

Two deployments hold a full copy of the data:

  • Primary — receives live traffic, is the authoritative source of writes.
  • Standby — kept in sync via replication (e.g. WAL replication — see concepts/wal-replication). Does not serve live writes; may serve read-only offline workflows, backups, or bulk scans.

On primary failure, ops (or automation) promotes the standby to primary. This requires:

  1. Stopping the old primary (fencing) to prevent split-brain writes.
  2. Draining any in-flight replication from old → new.
  3. Updating client configuration / DNS / service discovery so traffic goes to the new primary.
  4. Optionally, later rebuilding a new standby.

Cluster-level vs per-node failover

Primary-standby failover happens at different granularities:

  • Per-node failover (inside a single cluster) — e.g. MySQL primary dies, a read-replica is promoted. Typically automated, measured in seconds to minutes.
  • Cluster-level failover (between two whole clusters) — the Pinterest HBase shape: all of cluster A fails, cluster B becomes the new primary for all traffic (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest). Usually human-driven, measured in minutes to hours; the blast radius of a wrong flip is the whole fleet.
  • Region-level failover — cross-region DR. Even coarser; often involves DNS, rehydrated caches, traffic-engineering changes.

The coarser the unit, the lower the frequency but the larger the consequences of a flip. Cluster-level failover at Pinterest was the escape valve for whole-cluster HBase incidents but was not a routine operation.

Why it works

  • Clean blast-radius separation. The standby is an independent failure domain — bad releases, configuration pushes, or JVM hangs on the primary don't propagate.
  • Offline workloads can ride the standby. Backups and resource- intensive scans run on standby without disturbing primary p99.
  • WAL replication keeps RPO bounded. Standby is at most a few seconds behind primary at steady state.

Tradeoffs

  • 2× cost. Two full copies of the infra. Combined with three-way intra-cluster replication on each side, Pinterest's HBase shape hit 6 replicas per record — see concepts/replica-cost-tradeoff.
  • Failover is the rare-path code. Exactly because it's rarely exercised, it tends to rot — unless actively drilled (see concepts/always-be-failing-over).
  • Split-brain risk if fencing is imperfect. Dual-primary writes during a failed flip can silently corrupt state.

Seen in

  • sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical wiki instance of cluster-level primary-standby failover. Pinterest's HBase production deployments ran as primary
  • standby pairs inter-replicated by WAL; "Upon failure of the primary cluster, a cluster-level failover is performed to switch the primary and standby clusters."
Last updated · 550 distilled / 1,221 read