Orchestrator failure detection and recovery: New Beginnings¶

Shlomi Noach's 2020 post (republished on the PlanetScale blog) describing how Orchestrator detects MySQL primary failures and how the Vitess-integrated fork transforms it from a topology observer into a goal-driven operator. The post is the canonical disclosure of two mechanisms that later became load-bearing in Vitess / VTOrc failure handling: (1) the holistic / triangulated detection model that cross-references orchestrator's own observations with what the replicas see, and (2) the cluster-aware, goal-oriented behaviour that converges a cluster to the state Vitess expects rather than just repairing what's visibly broken.

Summary¶

Orchestrator is an open-source MySQL replication-topology-management and HA tool; Vitess recently integrated a specialised Vitess-aware fork as a native component. In pure-MySQL async replication the critical failure scenario is a primary outage (crashed or network-isolated), and naive approaches — single-endpoint health checks, retry-and-wait, multi-probe quorum — each trade false positives against failover latency. Orchestrator takes a different approach: it triangulates its own primary reachability with the replicas' own view of the primary (they're already connected over MySQL protocol, pulling binlog). A failover is only declared when orchestrator and all replicas agree the primary is down — a single observation per agent, no retry intervals, because MySQL replication's own retry machinery is being reused as the liveness signal. Emergency probes speed up edge cases (orchestrator can't see the primary but replicas still think it's up; one replica reports a lost primary while others disagree; replicas can reach the primary but lag is growing — the locked-primary or too-many-connections case). For the last scenario the post describes a specific trick: emergently restart replication on all replicas, which closes and reopens the MySQL connections; if the primary is truly locked or at its connection limit, the replicas will fail to reconnect and normal failover logic takes over. Pre-integration, orchestrator and Vitess collaborated via pre- and post-recovery hook scripts, which led to split / co-primary states when events were lost. Integration eliminates that seam: the Vitess-integrated orchestrator reads MySQL metadata directly from the Vitess topology server and becomes goal-driven — convergence to Vitess's declared intent, not just opinion-based repair. The post closes with promotion policy: the choice of which replica to promote is now user-programmable via code-defined failover/recovery policies, extending the prior fixed-mode approach.

Key takeaways¶

Single-endpoint health check is unreliable (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings): probing :3306, running SELECT 1 FROM DUAL, or reading a status variable all fail the same way — a non-response could be a crashed primary, a network glitch, or a packet drop. Retry loops mitigate false positives but add per-check latency; Noach enumerates four questions operators face: "exactly when is enough tests? What is a reasonable check interval? What if the primary is really down? What if the problem is with the network between the primary and our testing endpoint?"
Orchestrator's triangulation uses replicas as additional probe points. Replicas are already connected to the primary over MySQL protocol and continuously pulling the changelog — their connection state is a free, authoritative liveness signal. Orchestrator asks two questions: "Am I failing to communicate with the primary? And are all replicas failing to communicate with the primary?" A failover fires only when both answers are yes. See concepts/holistic-failure-detection-via-replicas.
Single observation per agent is sufficient — no retry intervals. "Orchestrator doesn't do check intervals and a number of tests. It needs a single observation to act. Behind the scenes, orchestrator relies on the replicas themselves to run retries in intervals; that's how MySQL replication works anyhow, and orchestrator utilizes that." The post reuses MySQL's own replication-retry behaviour as the low-pass filter, removing the need for orchestrator-level retry tuning.
Emergency probes accelerate edge cases. Three scenarios and their responses (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings):
- Orchestrator can't see the primary, replicas disagree → emergently re-probe the replicas sooner than the normal few-second cadence.
- A first-tier replica reports lost primary, others disagree → emergently probe the primary directly; if it fails, fall back to the first scenario.
- Orchestrator can't reach the primary, replicas still say it's up, but replica lag is climbing → this is the locked-primary or too many connections case, where the oldest connections (the replicas') still work but new connections fail. See takeaway 5.
Replication-restart is the liveness probe for locked-primary. "Orchestrator can analyze that and emergently kick a replication restart on all replicas. This closes and reopens the TCP connections between replicas and primary. On locked primary or on 'too many connections' scenarios, replicas are expected to fail reconnecting, leading to a normal detection of a primary outage." The replicas are the oldest connections to the primary, which is why they alone still work — resetting them tests the current-connection-establishment path. See patterns/replication-restart-as-liveness-probe.
Pre-integration: hook-script collaboration caused split-brain. "For the past few years, orchestrator was an external entity to Vitess. The two would collaborate over a few API calls. orchestrator did not have any Vitess awareness, and much of the integration was done through pre- and post- recovery hooks, shell scripts and API calls. This led to known situations where Vitess and orchestrator would compete over a failover, or make some operations unknown to each other, causing confusion. Clusters would end up in split state, or in co-primary state. The loss of a single event could cause cluster corruption." Canonical case study for why hook-script integration at critical-path failure boundaries is structurally fragile — a single dropped event can split cluster state.
Goal-oriented orchestrator: cluster awareness transforms the behavioural model. The integrated fork reads MySQL metadata directly from the Vitess topology server — it knows two servers belong to the same cluster because Vitess says so, not because they happen to be in a replication chain. This unlocks new failure modes Orchestrator couldn't previously address (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings):
- Standalone replica that Vitess says should belong to a cluster → reconnect it (after GTID validation).
- Writable replica → flip to read-only.
- Read-only primary → flip to writable.
- Circular / multi-primary topology → pick the Vitess-declared primary, demote others. "A multi-primary setup is considered to be a failure scenario."
- Functional cluster with wrong server as primary (left over from a prematurely-terminated failover) → run a graceful takeover / planned-reparent to promote the correct server. This is the most load-bearing of the new behaviours.
Operations either fail or converge — no partial states. "Orchestrator's operations will either fail or converge to the desired state." This is the failure-handling invariant of a goal-driven operator: a partial state (half-flipped topology, half-promoted replica) is not an accepted intermediate — the operator retries until convergence or aborts cleanly. See concepts/goal-oriented-orchestrator.
Promotion policy becomes code, not config. For unexpected failover, orchestrator picks a new primary subject to: binary-log enabled, version match with other replicas, server-specific metadata from Vitess, and policy constraints the user defines (stay within DC vs stay across DC, same AZ vs opposite AZ, promote only semi-sync replicas, how to reconfigure post-promotion). "The new integration allows the user to choose a failover and recovery policy, that is described in code. Orchestrator and Vitess already support three pre-configured modes, but will also allow the user to define any arbitrary (within a set of rules) policy they may choose." (More in a follow-up post, not in scope here.)

Systems extracted¶

Orchestrator — MySQL replication-topology-management and HA tool; the subject of the post
VTOrc / Vitess-integrated Orchestrator fork — the Vitess-native fork described here (post uses "the integrated orchestrator"; the fork is what VTOrc grew into)
Vitess — the integrating platform; post canonicalises the integration
MySQL — the substrate whose async replication is being managed
vttablet — the Vitess-side agent that binds MySQL identity (schema, shard, role) to topology; named in the post as "an agent of sorts"

Concepts extracted¶

concepts/holistic-failure-detection-via-replicas — new: triangulation of primary-health via the replicas' own connection state
concepts/emergency-failure-probe — new: accelerate re-probing when one observer disagrees with others
concepts/goal-oriented-orchestrator — new: converge cluster to topology-server intent, not just repair visible breakage
concepts/anti-flapping — extended: orchestrator's canonical failure-detection model is what anti-flapping is layered on top of
concepts/split-brain — extended: post documents the pre-integration hook-script era that caused split-state
concepts/primary-standby-failover — the operational move orchestrator automates
concepts/vitess-topo-server — the authority orchestrator reads MySQL metadata from in the integrated fork

Patterns extracted¶

patterns/replication-restart-as-liveness-probe — new: restart replication on all replicas to force connection reset; surfaces locked-primary / too-many-connections
patterns/external-metadata-for-conflict-resolution — extended: the Vitess topology server is the external metadata source in the integrated model

Operational numbers and behaviours¶

Probe cadence: normal per-server probe is "once in a few seconds"; emergency probes fire sooner when needed.
Retries per observation: zero at the orchestrator layer — a single failed probe is sufficient. MySQL replication's own retry behaviour provides the low-pass filter.
Quorum requirement: orchestrator-cannot-see-primary AND all-replicas-cannot-see-primary. No percentage threshold; all replicas must agree.
High-availability deployment: orchestrator itself runs in a highly-available setup across AZs with quorum leadership before it can run failovers (noted in-post but scope-deferred).
Pre-integration failure mode frequency: "known situations" — unquantified, but post flags them as a structural correctness problem, not a tuning nit.

Caveats¶

Async-replication only: post explicitly scopes the discussion to MySQL async replication. Semi-sync (see concepts/mysql-semi-sync-replication) has different primary-outage dynamics that Noach defers.
2020-era snapshot: the integration described here is recent in the post ("we have recently integrated") — the further evolution (VTOrc rebranding, etcd-backed lock, per-shard electors, durability plugin) happens in subsequent posts in the Consensus-at-scale series (see systems/vtorc).
Post promises "more on policy in a future post" which is not included — the code-defined failover policy shape is sketched, not fully canonicalised here.
Triangulation assumes replicas are reachable from orchestrator — if orchestrator is network-partitioned from both primary and replicas, triangulation degrades to single-endpoint and the post implicitly defers to orchestrator's own HA quorum to avoid split-brain decisions.
Multi-primary is defined as a failure scenario — this is an opinionated stance aligned with Vitess's design but would be controversial in groups-replication or circular-async deployments. The post notes MySQL supports multi-writable-primaries and the discouragement is a convention, not a protocol limit.