Skip to content

PATTERN Cited by 1 source

Diagnose via heap-dump lock introspection

Problem

Thread dumps tell you who is waiting for what, but not always who holds what. Several failure modes make the thread dump silent about the lock owner:

  1. The primary dumping tool drops the metadata. In Java 21, jcmd Thread.dump_to_file does not include - locked / Locked ownable synchronizers entries — no lock-owner information in VT-capable thread dumps.
  2. The "owner" does not actually exist. When a lock is corrupted (stale AQS state, bitwise double-free on the lock word) or when the owner has released via a Condition.awaitNanos path and is requeued, no thread shows as the current owner.
  3. The thread dump is sampled at a moment where the owner is in an uninstrumented transition (between release and requeue).

Without the owner, you can't reason about why waiters don't progress.

Pattern

When the thread dump is exhausted as an evidence source:

  1. Capture a heap dump of the same process. jcmd <pid> GC.heap_dump /path/heap.hprof for JVMs; core dumps via gcore / gdb / kill -SIGQUIT for native runtimes.
  2. Identify the lock object on the heap by walking from a known waiter's stack-local references. Eclipse MAT's "inspector" pane is excellent for this.
  3. Read the lock's internal state fields directly — the AQS state word, exclusiveOwnerThread, waiter queue pointers — and cross-reference with the thread dump's thread IDs.
  4. Reverse-engineer what you need against the lock implementation source code (Java: AbstractQueuedSynchronizer.java; Rust: parking_lot's raw_rwlock.rs; etc.). You don't need to understand every line — just enough to interpret the observed state.

Why it works

  • All lock state is data. Every synchronization primitive represents its state as memory that can be read, however clever the encoding (bitpacked words, AQS queue links, per-thread ParkBlocker references).
  • Heap dumps are complete snapshots. Thread dumps are metadata about threads; heap dumps are state about the actual runtime objects, including the locks.
  • Cross-referencing with the thread dump turns an ambiguous scene into a fully reconstructed one: which threads are in the waiter queue, in what order, with what stack traces.

Canonical wiki instance — Netflix 2024-07-29

Netflix had a JVM hung with pinned virtual threads. Thread dump: - 4 VTs blocked on ReentrantLock.lock inside synchronized on the Brave span-finish path. - 1 more VT blocked on the same lock via a different (non-synchronized) path. - 1 platform thread (AsyncReporter flusher) blocked on the same lock, in the AQS.acquire post-awaitNanos reacquire path. - No thread showing as lock owner.

The heap dump, inspected via Eclipse MAT, revealed: - The ReentrantLock's AQS state shows no exclusiveOwnerThread. - The AQS waiter queue contained all 6 threads, in a FIFO order that put the flusher behind the pinned VTs. - The Condition's internal queue confirmed the flusher's recent awaitNanos release.

Interpretation: the flusher had the lock, released via awaitNanos, timed out, and was queued behind the already- waiting pinned VTs. The pinned VTs can't release their carriers (they're pinned) so can't be the next acquirer. The flusher is behind them in FIFO order. Starvation deadlock — visible only from the heap.

Same spirit, different substrate. Fly.io's 2025-05-28 investigation (sources/2025-05-28-flyio-parking-lot-ffffffffffffffff) read the 64-bit parking_lot lock word from a Rust core dump using gdb, identifying a concepts/bitwise-double-free corruption pattern. Same pattern — "when the thread dump is silent, the heap / core dump isn't".

Caveats

  • Heap dumps are large — hundreds of MB to GB — and take seconds to capture. Not suitable as a first-touch tool on every ticket.
  • Reading AQS state requires familiarity with the source code. Investment is one-time but non-trivial.
  • Eclipse MAT (or equivalent) is necessary — raw hprof inspection is impractical.
  • Not a substitute for better tooling. The right long-term fix for Netflix's case is JDK adding lock metadata back to the jcmd output. Heap-dump introspection is the fallback, not the workflow.
  • Heap dumps may not be allowed in some environments (PII in memory, export-control-sensitive data).

Seen in

Last updated · 319 distilled / 1,201 read