Skip to content

CONCEPT Cited by 1 source

Watchdog REPL-channel liveness probe

A REPL-channel liveness probe is the technique of instrumenting a long-running process with an in-process control REPL (read-eval-print loop accessible via local socket, admin CLI, or similar), and monitoring the response latency of that REPL externally. If the REPL stops responding, the process is presumed wedged — even if its external traffic still looks superficially healthy.

Why it beats process-state probes

pid existence, kill -0, /proc/<pid>/status reading — all of these can look healthy on a deadlocked process. The process is running, the threads are alive, epoll is sitting on an accept call. But no request is making progress.

A REPL-channel probe is work-shaped: the probe requires the process to run a small piece of internal logic (parse a command, produce a response). If any component the REPL needs is deadlocked, the probe fails.

Coverage depends on the REPL's design: if the REPL runs on a dedicated thread that doesn't touch the process's main data structures, it will stay responsive even when the main path is wedged, which defeats the purpose. Fly.io's fly-proxy's REPL is specifically designed to touch the proxy's state — which is why it becomes nonresponsive during a Catalog deadlock.

Fly.io's fly-proxy implementation

From the 2025-05-28 post:

"fly-proxy has an internal control channel (it drives a REPL operators can run from our servers). During a deadlock (or dead-loop or exhaustion), that channel becomes nonresponsive. We watch for that and bounce the proxy when it happens." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

The watchdog sees the REPL timeout, kills the proxy, and snaps a core dump on the way out. The REPL is thus doing triple duty: operator debugging surface + liveness probe + core-dump trigger. See patterns/watchdog-bounce-on-deadlock for the broader pattern.

Coverage and gaps

  • Catches: classical deadlocks, livelocks, resource exhaustion (e.g. FD exhaustion starving the REPL's accept path), busy-loops that starve out the REPL thread.
  • Misses: partial degradation where the REPL is fine but N% of requests fail (needs error-rate monitoring); bugs where the REPL is touched by a cold path independent of the failing hot path (so the REPL can't see what's broken); subtle latency degradation (depends on whether the watchdog's timeout is tight).
  • Artificial deadlocks (see concepts/bitwise-double-free): caught fine — the symptom from the outside is indistinguishable from a conventional deadlock.

Seen in

  • sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Installed after Fly.io's 2024 global Anycast deadlock to convert future deadlocks from "asystole" (fleet-wide cardiac arrest) to "arrhythmia" (second-or-two hiccup with core-dump collection). Proved load-bearing during the 2025 wave of lockups because Fly was able to recover even though they didn't yet understand the bug.
Last updated · 200 distilled / 1,178 read