PATTERN Cited by 1 source

Watchdog bounce on deadlock¶

Problem¶

A concurrency bug (deadlock, livelock, resource exhaustion) in a long-running process can wedge the process indefinitely. In an active-active anycast fleet, the same wedge replicates across every node, causing a correlated fleet-wide outage. You don't know if you can rely on std::sync::Mutex, your lock library, your async runtime, your compiler, or your own code; but you know that a wedged process is useless and a restarted one is useful.

Pattern¶

Pair every long-running process with:

A liveness probe that demands work from the process (not just kill -0). Common choices: an internal REPL / admin socket that requires executing a small piece of internal logic to respond; an HTTP health endpoint that reads the same data structures the main path does; a dedicated "heartbeat" thread that touches shared state.
An external supervisor (separate process / separate thread outside the locking web) that polls the probe. If the probe stops responding within a timeout, the supervisor kills the process.
A core-dump hook on the kill so the post-mortem data is captured automatically. Linux: ulimit -c unlimited + /proc/sys/kernel/core_pattern, or explicit gcore <pid> by the supervisor before kill -9.
A respawn mechanism (systemd, an orchestrator, or a supervising script) that starts a fresh process after the kill.

Why it works¶

Converts a fleet-wide outage into a few-second hiccup per wedged node. If the bug doesn't recur on every run, healthy nodes absorb the load.
Each bounce produces a core dump, so the bug's evidence accumulates over time, making root-causing eventually tractable.
Doesn't require the supervisor to understand the bug — just to detect the "no progress" signature.
Non-destructive: wedged processes can't do useful work anyway, so killing them costs only connection termination.

Caveats¶

The probe must touch the process's main data structures (or a stand-in that blocks when the main path does). Otherwise a probe thread that's fine while the main thread is wedged produces false negatives.
The supervisor's timeout must account for legitimate slow paths (GC pauses, batch work, startup) — too tight and you bounce healthy processes; too loose and deadlocks persist longer than needed.
A watchdog is a safety net, not a fix — production concurrency bugs still need to be root-caused. See concepts/descent-into-madness-debugging for the debugging phase a watchdog buys you time to live through.
The pattern masks the distinction between deadlock and severe lock contention (see concepts/deadlock-vs-lock-contention); both appear as watchdog bounces. Use patterns/lock-timeout-for-contention-telemetry to discriminate.

Canonical instance — Fly.io's fly-proxy watchdog¶

Installed after Fly.io's 2024 global Anycast deadlock caused by an if let over a read lock:

"In the short term: we made deadlocks nonlethal with a 'watchdog' system. fly-proxy has an internal control channel (it drives a REPL operators can run from our servers). During a deadlock (or dead-loop or exhaustion), that channel becomes nonresponsive. We watch for that and bounce the proxy when it happens. A deadlock is still bad! But it's a second-or-two-length arrhythmia, not asystole." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

The REPL channel is the probe (see concepts/watchdog-repl-channel); the bounce is the recovery; the core-dump collection on bounce was load-bearing for the 2025 bug — Pavel Zakharov's core-dump inspection was how Fly eventually got the evidence that forced them off both the slow-reader and pure-deadlock theories and toward the bitwise double-free root cause.

Seen in¶

sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Both the short-term safety-net intent (make deadlocks nonlethal) and the long-term debugging dividend (core dumps became the evidence substrate) are documented. Post explicitly observes "We are at this moment very happy we did the watchdog thing."

systems/fly-proxy — The system with the watchdog installed.
systems/parking-lot-rust — The library whose bug made the watchdog a load-bearing asset in 2025.
concepts/watchdog-repl-channel — The liveness-probe technique this pattern uses.
concepts/deadlock-vs-lock-contention — Why the bounce signal is ambiguous about root cause.
patterns/lock-timeout-for-contention-telemetry — The complementary pattern that discriminates bounce causes.
companies/flyio — Fly.io.