CONCEPT Cited by 1 source

Holistic failure detection via replicas¶

Definition¶

Holistic failure detection (in the Orchestrator sense) is a primary-outage detection scheme that triangulates the failure observer's own reachability to the primary with the existing replicas' reachability to the primary. A failover is declared only when the observer and all replicas agree the primary is unreachable. Replicas are a free, authoritative second opinion because they are already connected over MySQL protocol, continuously pulling the changelog.

Shlomi Noach's framing:

"orchestrator asks: Am I failing to communicate with the primary? And, are all replicas failing to communicate with the primary? If, for example, orchestrator is unable to reach the primary, but can reach the replicas, and the replicas are all happy and confident that they can read from the primary, then orchestrator concludes there's no failure scenario." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

Why it works¶

A single-endpoint health check — SELECT 1, a status variable read, TCP probe on :3306 — fails the same way on three very different conditions:

Primary genuinely crashed.
Transient packet drop between observer and primary.
Network partition isolating the observer (but not the replicas) from the primary.

Conditions 2 and 3 are false positives that a naive detector would turn into an unnecessary failover. Retry-and-wait loops mitigate condition 2 but not condition 3, and they add per-check latency if the primary really is down.

Holistic detection sidesteps the false-positive problem by asking a different population: the replicas, which are sitting on persistent TCP connections pulling binlog, have continuous liveness evidence the observer lacks. If orchestrator can't reach the primary but the replicas can, the problem is in the orchestrator-to-primary path, not the primary itself.

The observation is single-shot, not retried¶

A load-bearing design property:

"orchestrator doesn't do check intervals and a number of tests. It needs a single observation to act. Behind the scenes, orchestrator relies on the replicas themselves to run retries in intervals; that's how MySQL replication works anyhow, and orchestrator utilizes that." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

Because replicas already retry internally (MySQL replication's own reconnect machinery), the retry burden doesn't belong in orchestrator. Orchestrator fires on a single agreement between its own observation and the replicas' current state — avoiding the "how many retries × what interval" tuning problem entirely.

Contrast with multi-endpoint quorum¶

A traditional multi-endpoint quorum approach deploys probes in multiple AZs and requires a majority to agree the primary is down. This works but imposes operational cost:

Probe placement must be carefully cross-AZ to produce sensible quorum results; misplaced probes produce wrong decisions.
Each probe is still a single-endpoint health check with its own false-positive profile.
Adding probes is an infrastructure decision; adding replicas is a capacity decision the team is already making.

Holistic detection reuses the replicas that already exist for read-scale-out. Orchestrator doesn't deploy a probe mesh — it queries the population that's already there.

Graceful degradation for partial failures¶

The "and all replicas" quorum handles compound failures too:

"Possibly some of the replicas themselves are unreachable: maybe a network partitioning or some power failure took both primary and a few of the replicas. orchestrator can still reach a conclusion by the state of all available replicas." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

Orchestrator does not require all originally-known replicas to respond — it reasons from the state of the reachable subset. A partition that takes the primary plus some replicas still yields a consistent verdict from the remaining replicas.

Granularity benefit¶

Holistic detection also exposes severity, not just up/down:

Orchestrator sees primary down, replicas disagree → suspicious, do an emergency probe, don't fail over yet.
Orchestrator sees primary up, replicas disagree → still a failure signal worth investigating.
Chained-topology case: an intermediate replica failure is detected by the same mechanism — its own replicas report it as down. The model generalises to multi-tier replication trees without extra logic.

Dependency on orchestrator's own HA¶

Holistic detection delegates some of the "who decides?" question to replicas, but orchestrator itself still needs to be reachable to coordinate the failover. The post flags (scope-deferred) that orchestrator runs in a highly-available setup across AZs with quorum leadership, mitigating the case where orchestrator itself is network-isolated.

Seen in¶

sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings — canonical wiki introduction. Noach canonicalises holistic detection as Orchestrator's "different take on triangulation" vs conventional multi-probe quorum; the single-observation-per-agent design as a consequence of delegating retries to MySQL's own replication machinery; the graceful-degradation story for compound failures; the severity-granularity benefit; and the generalisation to intermediate-replica failure detection in chained topologies.

concepts/anti-flapping — the rate-limit layer that runs on top of holistic detection, serialising successive failovers
concepts/emergency-failure-probe — the accelerator that fires when holistic detection has a one-sided signal
concepts/split-brain — what holistic detection helps avoid by suppressing false-positive failovers
concepts/primary-standby-failover — the operational move holistic detection gates
concepts/binlog-replication — the substrate that makes replicas usable as probe points
systems/orchestrator
systems/vtorc
systems/vitess
systems/mysql
patterns/replication-restart-as-liveness-probe — the companion probe for the "replicas say up but lag growing" edge case
patterns/multi-endpoint-quorum-health-check — the alternative approach holistic detection replaces