CONCEPT Cited by 1 source
Watchdog REPL-channel liveness probe¶
A REPL-channel liveness probe is the technique of instrumenting a long-running process with an in-process control REPL (read-eval-print loop accessible via local socket, admin CLI, or similar), and monitoring the response latency of that REPL externally. If the REPL stops responding, the process is presumed wedged — even if its external traffic still looks superficially healthy.
Why it beats process-state probes¶
pid existence, kill -0, /proc/<pid>/status reading —
all of these can look healthy on a deadlocked process. The
process is running, the threads are alive, epoll is sitting
on an accept call. But no request is making progress.
A REPL-channel probe is work-shaped: the probe requires the process to run a small piece of internal logic (parse a command, produce a response). If any component the REPL needs is deadlocked, the probe fails.
Coverage depends on the REPL's design: if the REPL runs on a dedicated thread that doesn't touch the process's main data structures, it will stay responsive even when the main path is wedged, which defeats the purpose. Fly.io's fly-proxy's REPL is specifically designed to touch the proxy's state — which is why it becomes nonresponsive during a Catalog deadlock.
Fly.io's fly-proxy implementation¶
From the 2025-05-28 post:
"
fly-proxyhas an internal control channel (it drives a REPL operators can run from our servers). During a deadlock (or dead-loop or exhaustion), that channel becomes nonresponsive. We watch for that and bounce the proxy when it happens." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
The watchdog sees the REPL timeout, kills the proxy, and snaps a core dump on the way out. The REPL is thus doing triple duty: operator debugging surface + liveness probe + core-dump trigger. See patterns/watchdog-bounce-on-deadlock for the broader pattern.
Coverage and gaps¶
- Catches: classical deadlocks, livelocks, resource
exhaustion (e.g.
FDexhaustion starving the REPL's accept path), busy-loops that starve out the REPL thread. - Misses: partial degradation where the REPL is fine but N% of requests fail (needs error-rate monitoring); bugs where the REPL is touched by a cold path independent of the failing hot path (so the REPL can't see what's broken); subtle latency degradation (depends on whether the watchdog's timeout is tight).
- Artificial deadlocks (see concepts/bitwise-double-free): caught fine — the symptom from the outside is indistinguishable from a conventional deadlock.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Installed after Fly.io's 2024 global Anycast deadlock to convert future deadlocks from "asystole" (fleet-wide cardiac arrest) to "arrhythmia" (second-or-two hiccup with core-dump collection). Proved load-bearing during the 2025 wave of lockups because Fly was able to recover even though they didn't yet understand the bug.
Related¶
- systems/fly-proxy — Where the REPL-channel watchdog runs.
- patterns/watchdog-bounce-on-deadlock — The pattern this concept grounds.
- concepts/deadlock-vs-lock-contention — The watchdog treats both as failures but can't distinguish them.
- companies/flyio — Fly.io.