PATTERN Cited by 1 source
Read-recursive as desperation probe¶
Problem¶
Every theory you have about a concurrency bug has been refuted
by evidence. The code isn't deadlocked (core dumps show no
owner). Lock timings don't reveal a slow holder. parking_lot's
deadlock detector reports nothing. The bug is in a widely-used
library or (worse) the compiler. You're past instrumentation —
each new piece you add confirms you still don't understand the
failure mode. You need new evidence, any new evidence, and
you don't care whether the probe itself makes any sense.
Pattern¶
Switch parking_lot's
RwLock::read() calls to
read_recursive() throughout
the contested code path — not because the code is re-entrant
or because you believe writer-preference is the bug, but
because:
- It bypasses the writer-preference logic, which is a distinct
code path in
parking_lot's raw RwLock implementation. - If your hypothesis was "a slow reader somewhere is poisoning
the lock",
read_recursivelets you see whether new readers can still acquire despite that. - Most importantly: if there's lock-state corruption, the
different code path will probably hit a different internal
assertion and produce a different log message than plain
read(). Even if the bug isn't fixed, the error messages change, and that change is evidence.
Why the probe is productive even though it isn't a fix¶
read_recursive exercises a subset of parking_lot's lock-
word manipulation routines that read() doesn't, including
counter-increment paths that can detect counter saturation (the
RwLock reader count overflow message). Running the workload
with read_recursive is a differential probe: anything the
code path does differently from read() is new evidence.
Result shape¶
You may get:
- Silence: probe behaves exactly like
read(). The bug isn't in that code path. Rule it out; move on. - Different error messages: you've surfaced something the
read()path was hiding. Investigate. - A fix: pure luck. Don't count on it.
- A new bug:
read_recursiveintroduces writer starvation and re-entrance risks in production code; running the probe in prod for long is not advised.
When to use (and when not to)¶
Use when:
- You're in the descent-into-madness phase — all theories refuted, all instrumentation saturated.
- You have a watchdog safety net so bad probes don't cause extended outages.
- You're willing to swap the probe back out once it's produced (or failed to produce) new evidence.
Don't use when:
- You have any well-defined hypothesis left to test with lower-risk instrumentation.
- The probe would run in production long enough to introduce a different class of bug.
- You're in a system without a watchdog — if the probe wedges differently, you've just introduced a harder bug.
Canonical instance — Fly.io parking_lot bug, Round 5¶
From the 2025-05-28 parking_lot post:
*"Fuck it, we'll switch to
read_recursive. A recursive read lock is an eldritch rite invoked when you need to grab a read lock deep in a call tree where you already grabbed that lock, but can't be arsed to structure the code properly to reflect that. When you ask for a recursive lock, if you already hold the lock, you get the lock again, instead of a deadlock. Our theory:parking_lotgoes through some trouble to make sure a stampede of readers won't starve writers, who are usually outnumbered. It prefers writers by preventing readers from acquiring locks when there's at least one waiting writer. Andread_recursivesidesteps that logic. Maybe there's some ultra-slow reader somehow not showing up in our traces, and maybe switching to recursive locks will cut through that.This does not work. At least, not how we hoped it would. It does generate a new piece of evidence:
RwLock reader count overflowlog messages, and lots of them."* (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
The probe didn't fix the bug. But RwLock reader count
overflow was the first direct evidence that the lock word
itself was being corrupted — the evidence that led, via
close-reading the parking_lot source, to the
bitwise double-free root
cause. Ptacek's "Ex Insania, Claritas" section header names
the moment.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance.
Related¶
- systems/parking-lot-rust — The library exposing
read_recursive. - concepts/read-recursive-lock — The primitive itself.
- concepts/descent-into-madness-debugging — The phase that makes this pattern necessary.
- concepts/bitwise-double-free — The bug class the probe ultimately helped surface.
- companies/flyio — Fly.io.