PATTERN Cited by 1 source
Replication restart as liveness probe¶
Pattern¶
When holistic detection shows orchestrator cannot reach the primary, replicas can reach it, but replica lag is growing, do not trust the replicas' reachability signal. Instead, emergently restart replication on all replicas, forcing them to close and reopen their TCP connections to the primary. If the primary is locked or has exhausted its connection limit, the reconnection attempts fail and the system transitions into a normal primary-outage detection state.
Shlomi Noach's framing:
"orchestrator cannot reach the primary, replicas can all reach the primary, but lag on replicas is ever increasing. This may be a limbo scenario caused by either a locked primary, or a 'too many connections' situation. The replicas are likely to be some of the oldest connections to the primary. New connections cannot reach the primary and to the app it seems down, but replicas are still connected. orchestrator can analyze that and emergently kick a replication restart on all replicas. This closes and reopens the TCP connections between replicas and primary. On locked primary or on 'too many connections' scenarios, replicas are expected to fail reconnecting, leading to a normal detection of a primary outage." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)
Why naive probes fail in this scenario¶
The replicas' connections to the primary are typically the oldest connections in the MySQL connection table — established at replica-setup time and kept alive via heartbeats and the ongoing binlog fetch. When the primary is either:
- Locked (a long-running transaction is holding a critical lock, so new queries block indefinitely), or
- At its
max_connectionsceiling (all connection slots used; new clients getER_CON_COUNT_ERROR),
the existing replica connections keep working because they're already accepted. But new connections from application servers or from orchestrator itself fail — the app sees an unresponsive primary; orchestrator sees TCP refused or timed out.
Holistic detection, taken at face value, would say "primary is up — replicas see it" and suppress the failover. That's the wrong answer, because the primary is effectively useless to everything except the already-connected replicas. The growing replica lag is the tell: replicas can read binlog position updates, but can't keep up because their own apply threads may be blocked behind the same underlying problem that's blocking new connections (or simply because the primary is throttled).
The probe mechanism¶
- Orchestrator issues
STOP SLAVEthenSTART SLAVE(or the 8.0 equivalentSTOP/START REPLICA) on each replica. This closes the existing replication connection and opens a new one. - If the primary is healthy: all replicas reconnect successfully and resume binlog pulling. Orchestrator sees replication resumed, lag stops growing — system was healthy, no failover.
- If the primary is locked or at connection limit: reconnections fail, replicas now correctly report cannot reach primary, holistic detection now has consensus, and normal failover triggers.
The probe is effectively a controlled forced failure of the asymmetry — orchestrator deliberately destroys the stale "primary is up" signal to see whether a fresh observation reproduces it.
Why restart replicas rather than restart one replica¶
The scenario description says "all replicas". Restarting only one would:
- Leave the other replicas on stale connections, so holistic detection continues to see "replicas say primary is up".
- Risk the single restarted replica reconnecting successfully by coincidence (e.g. one connection slot freed up briefly), producing a false-negative.
Restarting all replicas closes the asymmetry comprehensively: either they all reconnect (primary was fine), or none of them do (primary is genuinely unreachable from fresh connections).
Cost¶
Replication restart is not free:
- Each replica briefly stops fetching binlog — a few-seconds blip in replication freshness.
- If the primary is healthy, the probe costs nothing in outcome but does add connection-churn and a brief replication pause.
- If the primary has a transient connection issue that resolves between the probe and the reconnect, the probe may trigger unnecessary reconnection (but will succeed, so no failover).
The cost is judged acceptable because the alternative — waiting for normal probe cadence to catch up — extends total application-observable outage time by seconds to minutes, and the scenario (locked primary) is a real production failure mode.
Relationship to the holistic-detection signal¶
This pattern is the third scenario in Noach's emergency-probe taxonomy — see concepts/emergency-failure-probe. It is triggered when:
- Orchestrator's own probe fails.
- Replicas' probe succeeds.
- Replica lag is climbing (not stable).
The lag signal is load-bearing — without it, orchestrator would assume the asymmetry is purely orchestrator-side (which is the scenario 1 case, handled by accelerated re-probing of replicas).
When to use¶
- MySQL clusters with replicas readable by orchestrator over MySQL protocol (Vitess, vanilla Orchestrator, Group Replication setups).
- Environments where
max_connectionsexhaustion is a real failure mode (high-concurrency apps behind connection poolers — see concepts/max-connections-ceiling). - Any topology where "primary locked on long transaction" is a recurring incident class.
When not to use¶
- If replicas cannot be restarted cheaply (e.g. semi-sync setups where the restart affects commit ack latency — the probe cost may exceed acceptable collateral impact).
- If you cannot distinguish replica lag growing due to locked primary from replica lag growing due to replica-side problems (a slow replica apply thread, replica CPU saturation). The lag-growing-but-connection-alive signal is only useful when you can attribute it to the primary.
Seen in¶
- sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings — canonical wiki instance. Noach canonicalises the locked-primary / too-many-connections scenario, the insight that replicas are the oldest connections and therefore the last signals to notice a new-connection-failure state, and the replication-restart manoeuvre as the emergency probe. The post positions this as one of three emergency-probe scenarios Orchestrator supports.
Related¶
- concepts/holistic-failure-detection-via-replicas — the steady-state detection model this probe unblocks
- concepts/emergency-failure-probe — the broader concept this is one instantiation of
- concepts/anti-flapping — the rate-limit layer that runs after this probe confirms a real failure
- concepts/binlog-replication — the substrate whose connection state is being tested
- concepts/max-connections-ceiling — one of the failure modes this probe surfaces
- systems/orchestrator
- systems/vtorc
- systems/mysql
- patterns/multi-endpoint-quorum-health-check — the alternative approach for detecting this scenario (deploy a probe that simulates a fresh app-side connection)