Skip to content

PATTERN Cited by 1 source

Read-recursive as desperation probe

Problem

Every theory you have about a concurrency bug has been refuted by evidence. The code isn't deadlocked (core dumps show no owner). Lock timings don't reveal a slow holder. parking_lot's deadlock detector reports nothing. The bug is in a widely-used library or (worse) the compiler. You're past instrumentation — each new piece you add confirms you still don't understand the failure mode. You need new evidence, any new evidence, and you don't care whether the probe itself makes any sense.

Pattern

Switch parking_lot's RwLock::read() calls to read_recursive() throughout the contested code path — not because the code is re-entrant or because you believe writer-preference is the bug, but because:

  • It bypasses the writer-preference logic, which is a distinct code path in parking_lot's raw RwLock implementation.
  • If your hypothesis was "a slow reader somewhere is poisoning the lock", read_recursive lets you see whether new readers can still acquire despite that.
  • Most importantly: if there's lock-state corruption, the different code path will probably hit a different internal assertion and produce a different log message than plain read(). Even if the bug isn't fixed, the error messages change, and that change is evidence.

Why the probe is productive even though it isn't a fix

read_recursive exercises a subset of parking_lot's lock- word manipulation routines that read() doesn't, including counter-increment paths that can detect counter saturation (the RwLock reader count overflow message). Running the workload with read_recursive is a differential probe: anything the code path does differently from read() is new evidence.

Result shape

You may get:

  • Silence: probe behaves exactly like read(). The bug isn't in that code path. Rule it out; move on.
  • Different error messages: you've surfaced something the read() path was hiding. Investigate.
  • A fix: pure luck. Don't count on it.
  • A new bug: read_recursive introduces writer starvation and re-entrance risks in production code; running the probe in prod for long is not advised.

When to use (and when not to)

Use when:

  • You're in the descent-into-madness phase — all theories refuted, all instrumentation saturated.
  • You have a watchdog safety net so bad probes don't cause extended outages.
  • You're willing to swap the probe back out once it's produced (or failed to produce) new evidence.

Don't use when:

  • You have any well-defined hypothesis left to test with lower-risk instrumentation.
  • The probe would run in production long enough to introduce a different class of bug.
  • You're in a system without a watchdog — if the probe wedges differently, you've just introduced a harder bug.

Canonical instance — Fly.io parking_lot bug, Round 5

From the 2025-05-28 parking_lot post:

*"Fuck it, we'll switch to read_recursive. A recursive read lock is an eldritch rite invoked when you need to grab a read lock deep in a call tree where you already grabbed that lock, but can't be arsed to structure the code properly to reflect that. When you ask for a recursive lock, if you already hold the lock, you get the lock again, instead of a deadlock. Our theory: parking_lot goes through some trouble to make sure a stampede of readers won't starve writers, who are usually outnumbered. It prefers writers by preventing readers from acquiring locks when there's at least one waiting writer. And read_recursive sidesteps that logic. Maybe there's some ultra-slow reader somehow not showing up in our traces, and maybe switching to recursive locks will cut through that.

This does not work. At least, not how we hoped it would. It does generate a new piece of evidence: RwLock reader count overflow log messages, and lots of them."* (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

The probe didn't fix the bug. But RwLock reader count overflow was the first direct evidence that the lock word itself was being corrupted — the evidence that led, via close-reading the parking_lot source, to the bitwise double-free root cause. Ptacek's "Ex Insania, Claritas" section header names the moment.

Seen in

Last updated · 200 distilled / 1,178 read