Skip to content

CONCEPT Cited by 1 source

Spurious-wakeup busy-loop

Definition

A spurious-wakeup busy-loop is the pathology where an async-Rust (or more generally, any poll-driven) state machine signals readiness to its executor without actually having anything new to do, causing the executor to re-enter poll in a tight loop and burn a CPU core at near 100%. The symptom is "high CPU, low work" — "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O" (Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy).

The two shapes

Per Fly.io's 2025-02-26 post, two variants in async Rust:

  1. Pending that wakes itself. A Future returns Pending and fires its own Waker — telling the executor "not ready, poll me again soon" — before anything has changed. Executor re-polls; same state; same spurious wake. Cycle.
  2. Ready that doesn't progress. An AsyncRead returns Ready without actually consuming data / advancing its state machine. The caller — faithfully looping poll_read until it stops being Ready per the contract — spins on it.

Both collapse to the same pathology: the Future looks alive to the executor but isn't making progress, and nothing external is going to wake it up and reset the loop.

Canonical diagnosis signal

Flamegraph profiling turns the pathology inside out: if the trace is dominated by low-level runtime or spancost infrastructure (tracing::Subscriber entering/exiting spans, tokio polling machinery, libc syscalls that return immediately) with almost nothing in the actual business-logic leaves, that's the signature. As Fly.io puts it — entering/exiting a tracing span in Tokio is supposed to be very fast, so if it dominates the profile, the code being traced must be doing essentially nothing, which means something is calling poll a lot and getting nothing back.

The Future's fully-qualified type in the flamegraph then identifies the guilty layer. Fly.io's 2025-02 case:

&mut fp_io::copy::Duplex<&mut fp_io::reusable_reader::ReusableReader<
  fp_tcp::peek::PeekableReader<
    tokio_rustls::server::TlsStream<…>>>, …>

Own-code wrappers (Duplex, ReusableReader, PeekableReader, MeteredIo, PermittedTcpStream) audited first; the one third-party layer — tokio_rustls::server::TlsStream — was guilty, via concepts/tls-close-notify edge case.

Why this is insidious

  • The bug is in one layer but the symptom is consumption of a whole CPU. Every wrapper between the bug and the executor is a false suspect.
  • CPU-pegging incidents present as "platform degradation" but the platform is fine; one or two tasks are just refusing to yield.
  • Routine mitigation (bouncing the process) clears the stuck tasks but not the trigger condition — it comes back as soon as traffic hits the right state again.
  • Cheap instrumentation would help but isn't default — patterns/spurious-wakeup-metric is Fly.io's explicit follow-up: "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often."

Seen in

Last updated · 200 distilled / 1,178 read