SYSTEM Cited by 1 source
Zipkin Reporter¶
The Zipkin Reporter Java library
(zipkin2.reporter.*) is the OpenZipkin client for asynchronously
shipping span data to a Zipkin-compatible collector. It's used as
the transport layer underneath the higher-level Brave tracer and,
transitively, underneath the Spring-world
Micrometer Tracing stack.
Key classes¶
BoundedAsyncReporter— non-blocking span reporter with a bounded in-memory queue.CountBoundedQueue— the queue implementation. Producer-consumer handoff between span-finishing threads (viaoffer) and the flusher (viadrainTo). Implementation uses a singleReentrantLock+Condition— see source.AsyncReporter.Flusher— background platform thread that callsCountBoundedQueue.drainTo→Condition.awaitNanosin a loop.
The lock the Netflix incident surfaced¶
The CountBoundedQueue's ReentrantLock is the structural
contention point in Netflix's 2024-07-29 VT-pinning bug:
- Every span finishing via the Brave path calls
CountBoundedQueue.offer, which acquires the lock. - The
AsyncReporter$Flusherholds the lock while draining, releases it viaCondition.awaitNanos, and reacquires it after the wait. - If callers to
offerrun on virtual threads inside asynchronizedblock, those VTs get pinned to their carrier threads while blocking onReentrantLock.lock(). On a 4-vCPU host, 4 such pinned VTs exhaust all carrier threads.
Zipkin Reporter itself is not at fault — the ReentrantLock is
a correct, efficient primitive. The pinning is a property of
the caller using synchronized on a VT.
Seen in¶
- sources/2024-07-29-netflix-java-21-virtual-threads-dude-wheres-my-lock
— Netflix Java 21 + Spring Boot 3 microservices: 4 VTs and 1
non-pinned VT + 1 platform-thread
AsyncReporterflusher all waiting on the sameCountBoundedQueueReentrantLock. The flusher was the owner, released viaawaitNanos, timed out, and the AQS FIFO queue put it behind the pinned VTs. None can run; fleet-wide starvation deadlock.
Related¶
- systems/micrometer-tracing — The upstream Spring observability
abstraction that wraps span-finish in
synchronizedon its Brave bridge path. - systems/spring-boot — The framework stack the Netflix incident runs on.
- concepts/virtual-thread-pinning — The failure class that surfaced at this lock.
- companies/netflix — Incident adopter.