Fly.io — parking_lot: ffffffffffffffff…¶
Summary¶
A Fly.io long-form debugging retrospective (2025-05-28, Thomas
Ptacek, Tier 3) on a weeks-long hunt for why proxies in European
regions — especially WAW (Warsaw) — were locking up after Fly
started broadening lazy-loading in fly-proxy's
Catalog. After auditing every if let-over-lock scope bug,
replacing
RAII lock guards with explicit closures for visibility,
instrumenting lock timings, enabling
parking_lot's deadlock detector, inspecting core dumps under
gdb, running under miri, setting up guard pages, and flailing
at read_recursive out of
desperation — Fly finally discovered a bitwise double-free bug
in parking_lot's RwLock::try_write_for wake-up path that
corrupted the 64-bit lock word to 0xFFFFFFFFFFFFFFFF under a
specific reader-release + writer-timeout timing race. The
signaling bits stayed set and the 60-bit reader counter was
maxed out, producing an "artificial deadlock" where every thread
waits for a lock no thread holds.
parking_lot PR #466
— canonical patterns/upstream-the-fix instance. The core
narrative shape is "lazy-loading refactor exposes writer
contention, contention reveals an upstream bug, upstream bug
proves the watchdog was
the right move".
Key takeaways¶
- Anycast routing protocol context: fly-proxy is Fly's Rust-written Anycast router. It manages millions of connections for millions of apps across 30+ regions. The hard part isn't proxying — it's state distribution: knowing where a Fly Machine is (potentially starting in <1 s, terminating instantly) so traffic lands on the right worker. "It's a lot of state to manage, and it's in constant flux. We refer to this as the 'state distribution problem', but really, it quacks like a routing protocol."
- Corrosion is the RIB; Catalog is the FIB. [[systems/
corrosion-swim|Corrosion2]] is the globally replicated
CRDT-SQLite SWIM-gossip system of record for routing
information.
fly-proxykeeps an in-memory aggregation called the Catalog — "a record of everything in Corrosion a proxy might need to know about to forward requests" — for fast decisions. "In somewhat the same sense as a router works both with a RIB and a FIB, there is in fly-proxy a system of record for routing information (Corrosion), and then an in-memory aggregation of that information used to make fast decisions." - Last year's outage, Round 0: a bug in an
if letoverself.load.read().get()caused the read-lock to be held across both theifarm and theelsearm — "you can think ofif letexpressions as being rewritten to the equivalentmatchexpression, where that lifespan is much clearer". A Corrosion update about an unused app propagated fleet-wide in ms and deadlocked the entire Anycast routing layer — global consensus that Anycast should be down. Canonical if-let lock-scope bug. - Watchdog + REPL as safety net. Post-outage, Fly made
deadlocks "nonlethal" with a watchdog system that
monitors
fly-proxy's internal REPL control channel. When the channel becomes nonresponsive (deadlock / dead-loop / exhaustion), the watchdog bounces the proxy. "A deadlock is still bad! But it's a second-or-two-length arrhythmia, not asystole." Canonical watchdog-bounce-on-deadlock + REPL-channel liveness probe instance. Also snaps core dumps on kill, which later proved load-bearing. - Regionalization as the long-term fix. The outage-causing update pertained to an app nobody used — "there wasn't any real reason for any fly-proxy to receive it in the first place". Fly is mid-migration to regionalize routing state so most updates stay within the region (Sydney, Frankfurt, Dallas) where they originate. "It's a huge lift. It's a lift we're still making!" Lazy-loading the Catalog is a key step.
- Round 1 — lazy-loading triggers lockups, but not the
outage-style kind. Broadening lazy-loading changes the
read/write pattern on the Catalog RWLocks. Proxies start
locking up, watchdog bounces them, rolled back. Two suspects:
lock contention from the new write pressure, and a suspicious
new
if let. Canonical example of deadlock-vs-contention confusion — they look identical via the watchdog signal. - Round 2 — lock refactor for visibility + timeouts. Fly
(a) eliminated the
if let, (b) switched every Catalog write lock from RAII-style lock acquisition to explicit closures — "you can look at the code and see precisely the interval in which the lock is held" — (c) usedparking_lot'stry_write_forwith aDurationtimeout so blocked writes fail and emit telemetry rather than hang, and (d) instrumented with labeled logs + metrics. Canonical lock-timeout-for-contention-telemetry instance. Still locks up, especially inWAW. - Round 3 — instrumentation returns nonsense. Lock-timeout
logs spam just before the watchdog bounce.
parking_lot's deadlock detector (runs on its own thread tracking a waiting-for dependency graph) detects nothing. Slow locks appear — but only right before the freeze, in benign quiet applications. - Round 4 — core dumps defy theory. Pavel Zakharov reading
the
WAWcores:"First, there is no thread that's running inside the critical section. Yet, there is a thread that's waiting to acquire write lock and a bunch of threads waiting to acquire a read lock."
Every single stack trace: everything wants the Catalog lock;
nobody has it. This breaks both the slow-reader and the
missed-deadlock theories. Canonical
descent-into-madness
moment — the bug is incompatible with every working model.
10. Round 5 — read_recursive as a desperation probe.
parking_lot's writer-preference logic prevents new
readers from acquiring while a writer is waiting, to avoid
writer-starvation. read_recursive
sidesteps that logic and lets re-entering readers grab the
lock regardless. Not the right primitive for this code, but
a probe: if there's some slow reader poisoning the
lock, maybe read_recursive cuts through. Canonical
read-recursive-as-desperation-probe instance. It doesn't
fix anything, but it produces new evidence —
RwLock reader count overflow messages start appearing in
logs. Lots of them.
11. The bitwise double-free in parking_lot. parking_lot's
RWLock state is a single 64-bit word — 4 signaling bits
(PARKED, WRITER_PARKED, WRITER, UPGRADEABLE) and a
60-bit reader counter. Clearing specific bits atomically is
implemented by adding the two's-complement inverse of
those bits to the word (a
self-synchronizing
atomic state update): if the bits you expect to be set
really are set, the add zeroes them; if they're not, you've
added a very large value to an uncontrolled word.
- Bits are 0b1010 (WRITER | WRITER_PARKED); prev_value
is 0; .wrapping_sub(0b1010) = 0xFFFFFFFFFFFFFFF6;
fetch_add cancels to 0. Works.
- Bits are 0b1000 (only WRITER, because WRITER_PARKED
already got cleared by someone else): add doesn't cancel
— state becomes 0xFFFFFFFFFFFFFFFE. Reader counter is
saturated (can't decrement), all signal bits set.
- "As a grace note, the locking code manages to set that
one last unset bit before the proxy deadlocks."
Canonical [bitwise double-free
on a tightly-encoded state word](<../concepts/bitwise-double-free.md>) instance.
- The race sequence:
- Thread 1 grabs a read lock.
- Thread 2 tries to grab a write lock with
try_write_fortimeout; it parks, settingWRITER+WRITER_PARKED. - Thread 1 releases its read lock, unparking the waiter,
which unsets
WRITER_PARKED(as part of wake). - Thread 2 wakes — "not for the reason it thinks" —
the timing looks to Thread 2 like a timeout, so it
tries to clear both
WRITERandWRITER_PARKED.WRITER_PARKEDwas already unset by step 3. Bitwise double-free. Lock word →0xFFFFFFFFFFFFFFFE→0xFFFFFFFFFFFFFFFFonce any thread sets the last bit. Catalog is eternally "owned by all of them and none of them". Everything waits forever; watchdog kills proxy.
- Fix:
parking_lotPR #466 — the writer bit is cleared separately in the same wakeup queue as the reader bit so the two clear operations can't race. "The fix is deployed, the lockups never recur." Canonical patterns/upstream-the-fix instance — Fly.io's fifth since 2024 after the V8 / Go-arm64 / rustls / Web Streams cases. - The
WAWmystery remains unresolved: why overwhelmingly in Warsaw? "Some kind of crazy regional timing thing? Something to do with the Polish kreska diacritic that makes L's sound like W's? The wax and wane of caribou populations? … We'll never know because we fixed the bug." — an honest admission that scale-surfaced race conditions often have regional correlations nobody ever fully explains. - Debugging arc gifts that outlast the bug: (a) all
if let-over-locks audited out; (b) RAII lock guards replaced fleet-wide with explicit closures, giving lock timing metrics; (c) labeled logs on slow writes with context data (app IDs); (d) last + current holder of the write lock tracked with context — "next time we have a deadlock, we should have all the information we need to identify the actors withoutgdbstack traces."
Systems¶
- systems/fly-proxy — Fly.io's Rust Anycast router; the system exhibiting the lockups; owns the Catalog RWLock refactor.
- systems/corrosion-swim — System-of-record for routing
info; Fly's SWIM-gossip CRDT-SQLite database. Its updates
are what
fly-proxy's Catalog aggregates. - systems/parking-lot-rust — NEW: Amanieu's
well-regarded replacement for
std::sync's locks. Source of the bug. 64-bit compact lock representation +try_write_for+ deadlock detector +read_recursive. - systems/tokio — Fly.io is a Tokio application; the post
briefly considers an async-lock-in-sync-context theory
before rejecting it. "
parking_lotlocks are synchronous, but we're a Tokio application; something somewhere could be taking an async lock that's confusing the runtime. Alas, no."
Concepts¶
- concepts/anycast — Fly.io runs anycast via fly-proxy; the 2024 outage's blast radius was global because routing state has a global broadcast domain (the motivating force behind regionalization).
- concepts/deadlock-vs-lock-contention — NEW: they
present identically under a watchdog-bounce signal; the
refactor to add
try_write_foris specifically how Fly separated them. - concepts/if-let-lock-scope-bug — NEW: the previous
year's outage; an
if letholding a read lock across both theifarm and theelsearm. - concepts/bitwise-double-free — NEW: double-free on the bits of a tightly-packed state word rather than on a memory allocation. The root cause of this post.
- concepts/lock-state-self-synchronizing — NEW: clearing specific bits of a word atomically by adding the inverse — elegant when the invariant holds, catastrophic when it doesn't.
- concepts/watchdog-repl-channel — NEW: liveness probe via an internal REPL control channel; if it stops responding, the process is presumed wedged.
- concepts/descent-into-madness-debugging — NEW: when every working model is incompatible with the evidence, and you're now suspecting the compiler. Named after Ptacek's "Descent Into Madness" section header.
- concepts/read-recursive-lock — NEW: re-entrant read lock that sidesteps writer-preference starvation avoidance. Used here as a probe, not a fix.
Patterns¶
- patterns/watchdog-bounce-on-deadlock — NEW: a separate supervisor that bounces a wedged process based on liveness-probe failure, converting a production-killing deadlock into a few-second hiccup + a core dump.
- patterns/raii-to-explicit-closure-for-lock-visibility — NEW: replace implicit scope-based lock release (RAII) with explicit closures around the critical section so the held interval is visible in code and instrumentable.
- patterns/lock-timeout-for-contention-telemetry — NEW:
use
try_write_for-style bounded lock acquisition to fail under contention and emit telemetry rather than block indefinitely. - patterns/read-recursive-as-desperation-probe — NEW: swap to a re-entrant read lock not to fix the bug but to change the symptom — new log messages reveal the pathology.
- patterns/upstream-the-fix — fifth Fly.io instance (after
the 2025-02-26 rustls fix for
sources/2025-02-26-flyio-taming-a-voracious-rust-proxy).
The fix is
parking_lotPR #466; reported as issue #465.
Operational numbers¶
- 30+ regions, "thousands of servers", "millions of connections for millions of apps" scale context.
- Fly Machine start latency: "potentially start in less than a second" — motivates routing-state freshness bound.
- Corrosion update propagation: "millisecond intervals of time" host-to-host.
parking_lotRWLock state: 64-bit word; 4 signaling bits (PARKED,WRITER_PARKED,WRITER,UPGRADEABLE); 60-bit reader counter.- Corrupt state observed:
0xFFFFFFFFFFFFFFFF(all 4 signal bits set + reader count saturated). - Watchdog bounce: "second-or-two-length arrhythmia".
- Lockups geography: "all in Europe, especially in
WAW".
Caveats¶
- Tier-3 source (Fly.io blog); ingested for architectural
- production-incident content (lock-contention debugging,
parking_lotbit-level state representation, watchdog + core-dump discipline, upstream-fix pattern instance) per AGENTS.md Tier-3 guidance. - The
WAW-specific timing was never root-caused — post explicitly declines to speculate. - Post doesn't give traffic-loss / customer-impact numbers for this round of lockups (only context on the 2024 outage, which was different).
- The
parking_lotbit-trick code shown is reformatted for readability ("let's rephrase that code to see what it's actually doing") — the actual upstream source is linked. parking_lotdeadlock detector explicitly missed this case, because it tracks waiter-graph cycles — the bug produces an artificial deadlock (no owner, so no cycle). Useful documentation of what the detector does and doesn't catch.mirifound UB in tests that got fixed but didn't fix the lockup. Guard pages set up but never tripped. Useful negative results on debugging primitives.- Not an async-Rust bug — Tokio cleared. This is a straightforward synchronous-lock concurrency bug in a widely trusted crate.
- Character: Thomas Ptacek writing Fly.io blog voice, dense with pop-culture and literary references ("Dramatis Personae", Cook's "How Complex Systems Fail", "Ex Insania, Claritas", Chernobyl RBMK, the Pavel-a-genius framing). Substance is architectural despite tone.
Source¶
- Original: https://fly.io/blog/parking-lot-ffffffffffffffff/
- Raw markdown:
raw/flyio/2025-05-28-parking_lot-ffffffffffffffff-1221df2f.md
Related¶
- systems/fly-proxy — The system under debug.
- systems/corrosion-swim — Source of the routing-state updates whose read/write pattern exposed the bug.
- systems/parking-lot-rust — The crate with the bug.
- systems/tokio — Fly-proxy's async runtime (not at fault here).
- concepts/anycast — The service model whose blast radius motivates regionalization.
- concepts/deadlock-vs-lock-contention — Why Round 1 looked like a deadlock.
- concepts/if-let-lock-scope-bug — The 2024 Round 0 outage class.
- concepts/bitwise-double-free — The root cause class.
- concepts/lock-state-self-synchronizing — The optimisation that enables the bug class.
- concepts/watchdog-repl-channel — The liveness probe that made deadlocks nonlethal.
- concepts/descent-into-madness-debugging — The debugging phase where every model breaks.
- concepts/read-recursive-lock — Used here as a probe, not a fix.
- patterns/watchdog-bounce-on-deadlock — Safety net for concurrency bugs.
- patterns/raii-to-explicit-closure-for-lock-visibility — Critical-section-as-closure refactor.
- patterns/lock-timeout-for-contention-telemetry — What
try_write_forwas for. - patterns/read-recursive-as-desperation-probe — The desperation-probe pattern.
- patterns/upstream-the-fix — Fifth Fly.io instance.
- sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — Sibling Fly-proxy Rust concurrency incident (spurious-wakeup busy-loop, upstreamed to rustls). Same proxy, different bug class.
- companies/flyio — Fly.io the company.