PATTERN Cited by 1 source
Spurious-wakeup metric¶
Intent¶
Emit a cheap metric — a counter or an event — whenever a
poll-driven state machine wakes up but makes no forward
progress, so that
spurious-wakeup busy-loops
are detected before they show up as a CPU-pegging incident.
The premise is simple:
"Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often." (Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy)
If the metric ticks more than noise level, something is wrong — either a bug in a state machine, or a misbehaving dependency.
Mechanics (async Rust)¶
In the Tokio / Future-driven shape:
- Every
pollcall that returnsPendingimmediately after being Waker-woken, without any observable input having arrived, is a spurious wakeup. - The corresponding
AsyncReadshape: everypoll_readthat returnsReadywithout advancing the read cursor or consuming any bytes from the underlying buffer is a spurious Ready.
Either can be instrumented:
- Middleware counter: wrap the outer Tokio
task with a decorator that tracks
(wakeups_received, polls_that_made_progress)and exports the ratio or the absolute spurious count. - Per-Future counter: specific hot Futures
(
Duplex-shaped stream-stream proxies, TLS sessions) record their own (poll count, work done) tuples. - Sampled profiling trigger: if CPU % on a core crosses a threshold, auto-capture a flamegraph — this is cheap when rare because the threshold isn't usually tripped.
The target ratio is "near-zero always." Any sustained non-zero rate is worth investigating.
Why it's cheap¶
- One atomic counter increment per poll is orders of magnitude cheaper than any real work the Future would do if it were making progress.
- The metric is zero 99.99% of the time — it doesn't add to normal-operation cost, it only exists to surface pathologies.
- Alerting is one-sided: the ceiling is low, the floor is zero, so threshold selection is trivial.
Caveat¶
The pattern is about detecting the pathology, not diagnosing it. Once the metric fires, the workflow is still patterns/flamegraph-to-upstream-fix — pull the profile, identify the guilty layer, fix it upstream. The metric shortens the incident-to-diagnosis window; it doesn't skip the diagnosis.
Seen in¶
- sources/2025-02-26-flyio-taming-a-voracious-rust-proxy —
Fly.io's closing lesson: the post explicitly commits to
adding this instrumentation to
fly-proxy. Canonical wiki statement, reached by retrospective after an incident that took "some number of hours" to diagnose the hard way. Pattern is aspirational-for-Fly.io-as-of-2025-02-26 rather than a shipped system — but the framing is load-bearing and the cost/benefit math is sound.
Related¶
- concepts/spurious-wakeup-busy-loop — the pathology the metric detects.
- concepts/cpu-busy-loop-incident — the operational shape the metric pre-empts.
- concepts/rust-waker — the primitive whose misuse is what gets counted.
- patterns/flamegraph-to-upstream-fix — the diagnostic workflow the metric fires into.
- systems/fly-proxy — the target service for this instrumentation.
- systems/tokio — the substrate.
- companies/flyio