Skip to content

PATTERN Cited by 1 source

Multi-endpoint quorum health check

Pattern

Deploy health-check probes at multiple endpoints (typically across availability zones), and only act on a failure signal when a majority (quorum) of endpoints agree the target is unreachable. This reduces the false-positive rate that a single probe suffers from when the network path between probe and target is itself flaky.

"Some monitoring solutions run health checks from multiple endpoints, and require a quorum, an agreement of the majority of check endpoints that there is indeed a problem. This kind of setup must be used with care; the placement of the endpoints in different availability zones is critical to achieve sensible quorum results. Once that's done, though, the triangulation is powerful and useful." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

Why quorum

A single health-check endpoint has the same failure surface as any other single network path:

  • Transient packet drop → false positive.
  • Local network issue → false positive.
  • Endpoint host itself partitioned from target → false positive.

A majority-of-N agreement suppresses these single-path failures; the probability of N/2+1 endpoints all seeing a real target-side issue simultaneously is much higher than N/2+1 endpoints all suffering uncorrelated network issues at once.

The AZ-placement trap

The post flags a load-bearing caveat:

"The placement of the endpoints in different availability zones is critical to achieve sensible quorum results."

If all probes are in the same AZ as the target, a quorum of probes failing is exactly the same signal as a single probe failing — the correlated failure mode (the target's AZ going down, taking the probes with it) is not what you want to detect. Quorum only buys independence if the probes are in independent failure domains.

Common failure: probe mesh deployed in one AZ for operational convenience, target in another. When the probe AZ has an outage, all probes fail, quorum fires, unnecessary failover triggers.

Comparison with holistic-detection (Orchestrator's approach)

Multi-endpoint quorum requires deploying a probe mesh. Orchestrator's holistic detection reuses the replicas that already exist for other reasons — the cluster already has replicas pulling binlog, so the extra observation points come for free. Noach presents holistic detection as a different take on triangulation:

"Orchestrator uses a different take on triangulation. It recognizes that there are more players in the field: the replicas."

Trade-offs:

Dimension Multi-endpoint quorum Holistic detection (via replicas)
Probe infrastructure Separate probe fleet Reuses existing replicas
Placement concern Cross-AZ critical Replicas already distributed per HA design
Failure independence Must be engineered Inherited from replica placement
Retry requirement Per-probe retry logic Delegated to MySQL replication's built-in retries
Generalises to chained replication No (single target) Yes (each node's replicas observe it)

For MySQL-specific deployments, holistic detection dominates. For targets without an equivalent "already-connected observer" population (e.g. a standalone service), multi-endpoint quorum remains the go-to pattern.

When to use

  • Stateless services with no persistent-connection consumer population to observe liveness.
  • Multi-region DNS / CDN / load-balancer health checks where the probe mesh is already part of the routing layer.
  • Any service where the target-connectedness signal you care about is "can a fresh connection reach it" and not "is an existing connection still alive".

When not to use

  • MySQL clusters with replicas — prefer holistic detection.
  • Databases with stable connection-pooling layers (app-side) that already provide rich liveness signal.
  • Systems where AZ-independence for probes is hard to guarantee (single-datacenter deployments) — quorum gives no benefit over a single probe.

Seen in

Last updated · 550 distilled / 1,221 read