Skip to content

FLYIO 2025-05-28 Tier 3

Read original ↗

Fly.io — parking_lot: ffffffffffffffff…

Summary

A Fly.io long-form debugging retrospective (2025-05-28, Thomas Ptacek, Tier 3) on a weeks-long hunt for why proxies in European regions — especially WAW (Warsaw) — were locking up after Fly started broadening lazy-loading in fly-proxy's Catalog. After auditing every if let-over-lock scope bug, replacing RAII lock guards with explicit closures for visibility, instrumenting lock timings, enabling parking_lot's deadlock detector, inspecting core dumps under gdb, running under miri, setting up guard pages, and flailing at read_recursive out of desperation — Fly finally discovered a bitwise double-free bug in parking_lot's RwLock::try_write_for wake-up path that corrupted the 64-bit lock word to 0xFFFFFFFFFFFFFFFF under a specific reader-release + writer-timeout timing race. The signaling bits stayed set and the 60-bit reader counter was maxed out, producing an "artificial deadlock" where every thread waits for a lock no thread holds. parking_lot PR #466 — canonical patterns/upstream-the-fix instance. The core narrative shape is "lazy-loading refactor exposes writer contention, contention reveals an upstream bug, upstream bug proves the watchdog was the right move".

Key takeaways

  1. Anycast routing protocol context: fly-proxy is Fly's Rust-written Anycast router. It manages millions of connections for millions of apps across 30+ regions. The hard part isn't proxying — it's state distribution: knowing where a Fly Machine is (potentially starting in <1 s, terminating instantly) so traffic lands on the right worker. "It's a lot of state to manage, and it's in constant flux. We refer to this as the 'state distribution problem', but really, it quacks like a routing protocol."
  2. Corrosion is the RIB; Catalog is the FIB. [[systems/ corrosion-swim|Corrosion2]] is the globally replicated CRDT-SQLite SWIM-gossip system of record for routing information. fly-proxy keeps an in-memory aggregation called the Catalog"a record of everything in Corrosion a proxy might need to know about to forward requests" — for fast decisions. "In somewhat the same sense as a router works both with a RIB and a FIB, there is in fly-proxy a system of record for routing information (Corrosion), and then an in-memory aggregation of that information used to make fast decisions."
  3. Last year's outage, Round 0: a bug in an if let over self.load.read().get() caused the read-lock to be held across both the if arm and the else arm — "you can think of if let expressions as being rewritten to the equivalent match expression, where that lifespan is much clearer". A Corrosion update about an unused app propagated fleet-wide in ms and deadlocked the entire Anycast routing layer — global consensus that Anycast should be down. Canonical if-let lock-scope bug.
  4. Watchdog + REPL as safety net. Post-outage, Fly made deadlocks "nonlethal" with a watchdog system that monitors fly-proxy's internal REPL control channel. When the channel becomes nonresponsive (deadlock / dead-loop / exhaustion), the watchdog bounces the proxy. "A deadlock is still bad! But it's a second-or-two-length arrhythmia, not asystole." Canonical watchdog-bounce-on-deadlock + REPL-channel liveness probe instance. Also snaps core dumps on kill, which later proved load-bearing.
  5. Regionalization as the long-term fix. The outage-causing update pertained to an app nobody used"there wasn't any real reason for any fly-proxy to receive it in the first place". Fly is mid-migration to regionalize routing state so most updates stay within the region (Sydney, Frankfurt, Dallas) where they originate. "It's a huge lift. It's a lift we're still making!" Lazy-loading the Catalog is a key step.
  6. Round 1 — lazy-loading triggers lockups, but not the outage-style kind. Broadening lazy-loading changes the read/write pattern on the Catalog RWLocks. Proxies start locking up, watchdog bounces them, rolled back. Two suspects: lock contention from the new write pressure, and a suspicious new if let. Canonical example of deadlock-vs-contention confusion — they look identical via the watchdog signal.
  7. Round 2 — lock refactor for visibility + timeouts. Fly (a) eliminated the if let, (b) switched every Catalog write lock from RAII-style lock acquisition to explicit closures"you can look at the code and see precisely the interval in which the lock is held" — (c) used parking_lot's try_write_for with a Duration timeout so blocked writes fail and emit telemetry rather than hang, and (d) instrumented with labeled logs + metrics. Canonical lock-timeout-for-contention-telemetry instance. Still locks up, especially in WAW.
  8. Round 3 — instrumentation returns nonsense. Lock-timeout logs spam just before the watchdog bounce. parking_lot's deadlock detector (runs on its own thread tracking a waiting-for dependency graph) detects nothing. Slow locks appear — but only right before the freeze, in benign quiet applications.
  9. Round 4 — core dumps defy theory. Pavel Zakharov reading the WAW cores:

    "First, there is no thread that's running inside the critical section. Yet, there is a thread that's waiting to acquire write lock and a bunch of threads waiting to acquire a read lock."

Every single stack trace: everything wants the Catalog lock; nobody has it. This breaks both the slow-reader and the missed-deadlock theories. Canonical descent-into-madness moment — the bug is incompatible with every working model. 10. Round 5 — read_recursive as a desperation probe. parking_lot's writer-preference logic prevents new readers from acquiring while a writer is waiting, to avoid writer-starvation. read_recursive sidesteps that logic and lets re-entering readers grab the lock regardless. Not the right primitive for this code, but a probe: if there's some slow reader poisoning the lock, maybe read_recursive cuts through. Canonical read-recursive-as-desperation-probe instance. It doesn't fix anything, but it produces new evidenceRwLock reader count overflow messages start appearing in logs. Lots of them. 11. The bitwise double-free in parking_lot. parking_lot's RWLock state is a single 64-bit word — 4 signaling bits (PARKED, WRITER_PARKED, WRITER, UPGRADEABLE) and a 60-bit reader counter. Clearing specific bits atomically is implemented by adding the two's-complement inverse of those bits to the word (a self-synchronizing atomic state update): if the bits you expect to be set really are set, the add zeroes them; if they're not, you've added a very large value to an uncontrolled word. - Bits are 0b1010 (WRITER | WRITER_PARKED); prev_value is 0; .wrapping_sub(0b1010) = 0xFFFFFFFFFFFFFFF6; fetch_add cancels to 0. Works. - Bits are 0b1000 (only WRITER, because WRITER_PARKED already got cleared by someone else): add doesn't cancel — state becomes 0xFFFFFFFFFFFFFFFE. Reader counter is saturated (can't decrement), all signal bits set. - "As a grace note, the locking code manages to set that one last unset bit before the proxy deadlocks."

Canonical [bitwise double-free
on a tightly-encoded state word](<../concepts/bitwise-double-free.md>) instance.
  1. The race sequence:
    1. Thread 1 grabs a read lock.
    2. Thread 2 tries to grab a write lock with try_write_for timeout; it parks, setting WRITER + WRITER_PARKED.
    3. Thread 1 releases its read lock, unparking the waiter, which unsets WRITER_PARKED (as part of wake).
    4. Thread 2 wakes — "not for the reason it thinks" — the timing looks to Thread 2 like a timeout, so it tries to clear both WRITER and WRITER_PARKED. WRITER_PARKED was already unset by step 3. Bitwise double-free. Lock word → 0xFFFFFFFFFFFFFFFE0xFFFFFFFFFFFFFFFF once any thread sets the last bit. Catalog is eternally "owned by all of them and none of them". Everything waits forever; watchdog kills proxy.
  2. Fix: parking_lot PR #466 — the writer bit is cleared separately in the same wakeup queue as the reader bit so the two clear operations can't race. "The fix is deployed, the lockups never recur." Canonical patterns/upstream-the-fix instance — Fly.io's fifth since 2024 after the V8 / Go-arm64 / rustls / Web Streams cases.
  3. The WAW mystery remains unresolved: why overwhelmingly in Warsaw? "Some kind of crazy regional timing thing? Something to do with the Polish kreska diacritic that makes L's sound like W's? The wax and wane of caribou populations? … We'll never know because we fixed the bug." — an honest admission that scale-surfaced race conditions often have regional correlations nobody ever fully explains.
  4. Debugging arc gifts that outlast the bug: (a) all if let-over-locks audited out; (b) RAII lock guards replaced fleet-wide with explicit closures, giving lock timing metrics; (c) labeled logs on slow writes with context data (app IDs); (d) last + current holder of the write lock tracked with context — "next time we have a deadlock, we should have all the information we need to identify the actors without gdb stack traces."

Systems

  • systems/fly-proxy — Fly.io's Rust Anycast router; the system exhibiting the lockups; owns the Catalog RWLock refactor.
  • systems/corrosion-swim — System-of-record for routing info; Fly's SWIM-gossip CRDT-SQLite database. Its updates are what fly-proxy's Catalog aggregates.
  • systems/parking-lot-rustNEW: Amanieu's well-regarded replacement for std::sync's locks. Source of the bug. 64-bit compact lock representation + try_write_for + deadlock detector + read_recursive.
  • systems/tokio — Fly.io is a Tokio application; the post briefly considers an async-lock-in-sync-context theory before rejecting it. "parking_lot locks are synchronous, but we're a Tokio application; something somewhere could be taking an async lock that's confusing the runtime. Alas, no."

Concepts

  • concepts/anycast — Fly.io runs anycast via fly-proxy; the 2024 outage's blast radius was global because routing state has a global broadcast domain (the motivating force behind regionalization).
  • concepts/deadlock-vs-lock-contentionNEW: they present identically under a watchdog-bounce signal; the refactor to add try_write_for is specifically how Fly separated them.
  • concepts/if-let-lock-scope-bugNEW: the previous year's outage; an if let holding a read lock across both the if arm and the else arm.
  • concepts/bitwise-double-freeNEW: double-free on the bits of a tightly-packed state word rather than on a memory allocation. The root cause of this post.
  • concepts/lock-state-self-synchronizingNEW: clearing specific bits of a word atomically by adding the inverse — elegant when the invariant holds, catastrophic when it doesn't.
  • concepts/watchdog-repl-channelNEW: liveness probe via an internal REPL control channel; if it stops responding, the process is presumed wedged.
  • concepts/descent-into-madness-debuggingNEW: when every working model is incompatible with the evidence, and you're now suspecting the compiler. Named after Ptacek's "Descent Into Madness" section header.
  • concepts/read-recursive-lockNEW: re-entrant read lock that sidesteps writer-preference starvation avoidance. Used here as a probe, not a fix.

Patterns

Operational numbers

  • 30+ regions, "thousands of servers", "millions of connections for millions of apps" scale context.
  • Fly Machine start latency: "potentially start in less than a second" — motivates routing-state freshness bound.
  • Corrosion update propagation: "millisecond intervals of time" host-to-host.
  • parking_lot RWLock state: 64-bit word; 4 signaling bits (PARKED, WRITER_PARKED, WRITER, UPGRADEABLE); 60-bit reader counter.
  • Corrupt state observed: 0xFFFFFFFFFFFFFFFF (all 4 signal bits set + reader count saturated).
  • Watchdog bounce: "second-or-two-length arrhythmia".
  • Lockups geography: "all in Europe, especially in WAW".

Caveats

  • Tier-3 source (Fly.io blog); ingested for architectural
  • production-incident content (lock-contention debugging, parking_lot bit-level state representation, watchdog + core-dump discipline, upstream-fix pattern instance) per AGENTS.md Tier-3 guidance.
  • The WAW-specific timing was never root-caused — post explicitly declines to speculate.
  • Post doesn't give traffic-loss / customer-impact numbers for this round of lockups (only context on the 2024 outage, which was different).
  • The parking_lot bit-trick code shown is reformatted for readability ("let's rephrase that code to see what it's actually doing") — the actual upstream source is linked.
  • parking_lot deadlock detector explicitly missed this case, because it tracks waiter-graph cycles — the bug produces an artificial deadlock (no owner, so no cycle). Useful documentation of what the detector does and doesn't catch.
  • miri found UB in tests that got fixed but didn't fix the lockup. Guard pages set up but never tripped. Useful negative results on debugging primitives.
  • Not an async-Rust bug — Tokio cleared. This is a straightforward synchronous-lock concurrency bug in a widely trusted crate.
  • Character: Thomas Ptacek writing Fly.io blog voice, dense with pop-culture and literary references ("Dramatis Personae", Cook's "How Complex Systems Fail", "Ex Insania, Claritas", Chernobyl RBMK, the Pavel-a-genius framing). Substance is architectural despite tone.

Source

Last updated · 200 distilled / 1,178 read