CONCEPT Cited by 1 source

Emergency failure probe¶

Definition¶

An emergency failure probe is an out-of-cadence health check fired when holistic detection yields an asymmetric signal — one observer reports the primary down while others disagree. Instead of waiting for the next scheduled probe cycle (normally "once in a few seconds"), the orchestrator immediately re-probes the disagreeing party to resolve the discrepancy and shorten total outage detection time.

Shlomi Noach's framing:

"If orchestrator can't see the primary, but can see the replicas, and they still think the primary is up, should this be the end of the story? Not quite. We may well have an actual primary outage, it's just that the replicas haven't realized it yet. If we wait long enough, they will eventually report the failure; but orchestrator wishes to reduce total outage time by resolving the situation as early as possible." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

The three scenarios¶

Noach enumerates three asymmetric states and the emergency probe each triggers:

Orchestrator can't see the primary; replicas still confirm the primary is up.

"Emergently probe the replicas to check what they think. Normally each server is probed once in a few seconds, but orchestrator now chooses to probe sooner."

The emergency probe re-checks the replicas, not the primary — the asymmetry lives in the replica population's lag-to-detect, so accelerating their probe cycle is what unlocks the decision.

A first-tier replica reports it can't see the primary; other replicas and orchestrator see the primary fine.

"This is still very suspicious, so orchestrator runs an emergency probe on the primary. If that fails, then we're on to something, falling back to the first bullet."

Here the emergency probe is on the primary itself, because one replica's disagreement is weak evidence (it might just be that replica's network). If the primary also fails the emergency probe, orchestrator is now in scenario 1 and falls back.

3. Locked-primary / too-many-connections¶

Orchestrator can't reach the primary, replicas can reach it, but replica lag is growing.

"This may be a limbo scenario caused by either a locked primary, or a 'too many connections' situation. The replicas are likely to be some of the oldest connections to the primary. New connections cannot reach the primary and to the app it seems down, but replicas are still connected."

The emergency probe here is unusual: orchestrator uses the replication-restart trick — kick replication on all replicas, forcing TCP reconnect. If the primary is locked or at its connection limit, the replicas fail to reconnect and the system falls back into scenario 1.

Why "emergency" vs "just lower the interval"¶

The obvious alternative is to probe everything faster, all the time. Orchestrator doesn't do that — the per-server probe cadence is deliberately conservative ("once in a few seconds") to avoid loading the cluster with health-check traffic. Emergency probes are fired only when asymmetry is detected, keeping steady-state overhead low while preserving the ability to resolve disagreements fast.

The pattern is: low-frequency steady state + event-triggered burst on ambiguity. This is the common shape for cost-vs-latency tradeoff in monitoring systems (scheduled sampling + on-demand detail).

Interaction with anti-flapping¶

Emergency probes accelerate failover decisions; concepts/anti-flapping rate-limits them. The two serve opposite purposes but are compatible:

Emergency probe shrinks the time between outage-onset and first detection.
Anti-flapping blocks back-to-back failovers once one has been performed.

Together they give a system that detects real failures quickly but won't loop on persistent underlying problems (the classic anti-flapping concern — a timeout-causing request would re-trigger failover on every successor primary).

Seen in¶

sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings — canonical wiki introduction of the emergency-probe mechanism. Noach canonicalises the three-scenario taxonomy (orchestrator-blind, single-replica-blind, locked-primary-with-growing-lag), the replication-restart response for scenario 3, and the "reduce total outage time" motivation that makes the complexity worth it.

concepts/holistic-failure-detection-via-replicas — the steady-state detection model emergency probes plug into
concepts/anti-flapping — the opposing mechanism (rate-limit) that constrains failover pacing
concepts/primary-standby-failover — the operational move that emergency probes speed up
systems/orchestrator
systems/vtorc
systems/mysql
patterns/replication-restart-as-liveness-probe — the specific emergency probe for scenario 3

Emergency failure probe¶

Definition¶

The three scenarios¶

1. Orchestrator-blind, replicas-see¶

2. Single-replica-blind, others-see¶

3. Locked-primary / too-many-connections¶

Why "emergency" vs "just lower the interval"¶

Interaction with anti-flapping¶

Seen in¶

Emergency failure probe¶

Definition¶

The three scenarios¶

1. Orchestrator-blind, replicas-see¶

2. Single-replica-blind, others-see¶

3. Locked-primary / too-many-connections¶

Why "emergency" vs "just lower the interval"¶

Interaction with anti-flapping¶

Seen in¶

Related¶