Skip to content

SYSTEM Cited by 1 source

Zipkin Reporter

The Zipkin Reporter Java library (zipkin2.reporter.*) is the OpenZipkin client for asynchronously shipping span data to a Zipkin-compatible collector. It's used as the transport layer underneath the higher-level Brave tracer and, transitively, underneath the Spring-world Micrometer Tracing stack.

Key classes

  • BoundedAsyncReporter — non-blocking span reporter with a bounded in-memory queue.
  • CountBoundedQueue — the queue implementation. Producer-consumer handoff between span-finishing threads (via offer) and the flusher (via drainTo). Implementation uses a single ReentrantLock + Condition — see source.
  • AsyncReporter.Flusher — background platform thread that calls CountBoundedQueue.drainToCondition.awaitNanos in a loop.

The lock the Netflix incident surfaced

The CountBoundedQueue's ReentrantLock is the structural contention point in Netflix's 2024-07-29 VT-pinning bug:

  • Every span finishing via the Brave path calls CountBoundedQueue.offer, which acquires the lock.
  • The AsyncReporter$Flusher holds the lock while draining, releases it via Condition.awaitNanos, and reacquires it after the wait.
  • If callers to offer run on virtual threads inside a synchronized block, those VTs get pinned to their carrier threads while blocking on ReentrantLock.lock(). On a 4-vCPU host, 4 such pinned VTs exhaust all carrier threads.

Zipkin Reporter itself is not at fault — the ReentrantLock is a correct, efficient primitive. The pinning is a property of the caller using synchronized on a VT.

Seen in

  • sources/2024-07-29-netflix-java-21-virtual-threads-dude-wheres-my-lock — Netflix Java 21 + Spring Boot 3 microservices: 4 VTs and 1 non-pinned VT + 1 platform-thread AsyncReporter flusher all waiting on the same CountBoundedQueue ReentrantLock. The flusher was the owner, released via awaitNanos, timed out, and the AQS FIFO queue put it behind the pinned VTs. None can run; fleet-wide starvation deadlock.
Last updated · 319 distilled / 1,201 read