Skip to content

PATTERN Cited by 1 source

Watchdog bounce on deadlock

Problem

A concurrency bug (deadlock, livelock, resource exhaustion) in a long-running process can wedge the process indefinitely. In an active-active anycast fleet, the same wedge replicates across every node, causing a correlated fleet-wide outage. You don't know if you can rely on std::sync::Mutex, your lock library, your async runtime, your compiler, or your own code; but you know that a wedged process is useless and a restarted one is useful.

Pattern

Pair every long-running process with:

  1. A liveness probe that demands work from the process (not just kill -0). Common choices: an internal REPL / admin socket that requires executing a small piece of internal logic to respond; an HTTP health endpoint that reads the same data structures the main path does; a dedicated "heartbeat" thread that touches shared state.
  2. An external supervisor (separate process / separate thread outside the locking web) that polls the probe. If the probe stops responding within a timeout, the supervisor kills the process.
  3. A core-dump hook on the kill so the post-mortem data is captured automatically. Linux: ulimit -c unlimited + /proc/sys/kernel/core_pattern, or explicit gcore <pid> by the supervisor before kill -9.
  4. A respawn mechanism (systemd, an orchestrator, or a supervising script) that starts a fresh process after the kill.

Why it works

  • Converts a fleet-wide outage into a few-second hiccup per wedged node. If the bug doesn't recur on every run, healthy nodes absorb the load.
  • Each bounce produces a core dump, so the bug's evidence accumulates over time, making root-causing eventually tractable.
  • Doesn't require the supervisor to understand the bug — just to detect the "no progress" signature.
  • Non-destructive: wedged processes can't do useful work anyway, so killing them costs only connection termination.

Caveats

  • The probe must touch the process's main data structures (or a stand-in that blocks when the main path does). Otherwise a probe thread that's fine while the main thread is wedged produces false negatives.
  • The supervisor's timeout must account for legitimate slow paths (GC pauses, batch work, startup) — too tight and you bounce healthy processes; too loose and deadlocks persist longer than needed.
  • A watchdog is a safety net, not a fix — production concurrency bugs still need to be root-caused. See concepts/descent-into-madness-debugging for the debugging phase a watchdog buys you time to live through.
  • The pattern masks the distinction between deadlock and severe lock contention (see concepts/deadlock-vs-lock-contention); both appear as watchdog bounces. Use patterns/lock-timeout-for-contention-telemetry to discriminate.

Canonical instance — Fly.io's fly-proxy watchdog

Installed after Fly.io's 2024 global Anycast deadlock caused by an if let over a read lock:

"In the short term: we made deadlocks nonlethal with a 'watchdog' system. fly-proxy has an internal control channel (it drives a REPL operators can run from our servers). During a deadlock (or dead-loop or exhaustion), that channel becomes nonresponsive. We watch for that and bounce the proxy when it happens. A deadlock is still bad! But it's a second-or-two-length arrhythmia, not asystole." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

The REPL channel is the probe (see concepts/watchdog-repl-channel); the bounce is the recovery; the core-dump collection on bounce was load-bearing for the 2025 bug — Pavel Zakharov's core-dump inspection was how Fly eventually got the evidence that forced them off both the slow-reader and pure-deadlock theories and toward the bitwise double-free root cause.

Seen in

  • sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Both the short-term safety-net intent (make deadlocks nonlethal) and the long-term debugging dividend (core dumps became the evidence substrate) are documented. Post explicitly observes "We are at this moment very happy we did the watchdog thing."
Last updated · 200 distilled / 1,178 read