PATTERN Cited by 1 source
Watchdog bounce on deadlock¶
Problem¶
A concurrency bug (deadlock, livelock, resource exhaustion) in
a long-running process can wedge the process indefinitely. In
an active-active anycast fleet, the same wedge replicates
across every node, causing a correlated fleet-wide outage. You
don't know if you can rely on std::sync::Mutex, your lock
library, your async runtime, your compiler, or your own code;
but you know that a wedged process is useless and a restarted
one is useful.
Pattern¶
Pair every long-running process with:
- A liveness probe that demands work from the process
(not just
kill -0). Common choices: an internal REPL / admin socket that requires executing a small piece of internal logic to respond; an HTTP health endpoint that reads the same data structures the main path does; a dedicated "heartbeat" thread that touches shared state. - An external supervisor (separate process / separate thread outside the locking web) that polls the probe. If the probe stops responding within a timeout, the supervisor kills the process.
- A core-dump hook on the kill so the post-mortem data is
captured automatically. Linux:
ulimit -c unlimited+/proc/sys/kernel/core_pattern, or explicitgcore <pid>by the supervisor beforekill -9. - A respawn mechanism (systemd, an orchestrator, or a supervising script) that starts a fresh process after the kill.
Why it works¶
- Converts a fleet-wide outage into a few-second hiccup per wedged node. If the bug doesn't recur on every run, healthy nodes absorb the load.
- Each bounce produces a core dump, so the bug's evidence accumulates over time, making root-causing eventually tractable.
- Doesn't require the supervisor to understand the bug — just to detect the "no progress" signature.
- Non-destructive: wedged processes can't do useful work anyway, so killing them costs only connection termination.
Caveats¶
- The probe must touch the process's main data structures (or a stand-in that blocks when the main path does). Otherwise a probe thread that's fine while the main thread is wedged produces false negatives.
- The supervisor's timeout must account for legitimate slow paths (GC pauses, batch work, startup) — too tight and you bounce healthy processes; too loose and deadlocks persist longer than needed.
- A watchdog is a safety net, not a fix — production concurrency bugs still need to be root-caused. See concepts/descent-into-madness-debugging for the debugging phase a watchdog buys you time to live through.
- The pattern masks the distinction between deadlock and severe lock contention (see concepts/deadlock-vs-lock-contention); both appear as watchdog bounces. Use patterns/lock-timeout-for-contention-telemetry to discriminate.
Canonical instance — Fly.io's fly-proxy watchdog¶
Installed after Fly.io's 2024 global Anycast deadlock caused
by an if let over a read
lock:
"In the short term: we made deadlocks nonlethal with a 'watchdog' system.
fly-proxyhas an internal control channel (it drives a REPL operators can run from our servers). During a deadlock (or dead-loop or exhaustion), that channel becomes nonresponsive. We watch for that and bounce the proxy when it happens. A deadlock is still bad! But it's a second-or-two-length arrhythmia, not asystole." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
The REPL channel is the probe (see concepts/watchdog-repl-channel); the bounce is the recovery; the core-dump collection on bounce was load-bearing for the 2025 bug — Pavel Zakharov's core-dump inspection was how Fly eventually got the evidence that forced them off both the slow-reader and pure-deadlock theories and toward the bitwise double-free root cause.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Both the short-term safety-net intent (make deadlocks nonlethal) and the long-term debugging dividend (core dumps became the evidence substrate) are documented. Post explicitly observes "We are at this moment very happy we did the watchdog thing."
Related¶
- systems/fly-proxy — The system with the watchdog installed.
- systems/parking-lot-rust — The library whose bug made the watchdog a load-bearing asset in 2025.
- concepts/watchdog-repl-channel — The liveness-probe technique this pattern uses.
- concepts/deadlock-vs-lock-contention — Why the bounce signal is ambiguous about root cause.
- patterns/lock-timeout-for-contention-telemetry — The complementary pattern that discriminates bounce causes.
- companies/flyio — Fly.io.