Skip to content

PATTERN Cited by 1 source

Spurious-wakeup metric

Intent

Emit a cheap metric — a counter or an event — whenever a poll-driven state machine wakes up but makes no forward progress, so that spurious-wakeup busy-loops are detected before they show up as a CPU-pegging incident. The premise is simple:

"Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often." (Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy)

If the metric ticks more than noise level, something is wrong — either a bug in a state machine, or a misbehaving dependency.

Mechanics (async Rust)

In the Tokio / Future-driven shape:

  • Every poll call that returns Pending immediately after being Waker-woken, without any observable input having arrived, is a spurious wakeup.
  • The corresponding AsyncRead shape: every poll_read that returns Ready without advancing the read cursor or consuming any bytes from the underlying buffer is a spurious Ready.

Either can be instrumented:

  • Middleware counter: wrap the outer Tokio task with a decorator that tracks (wakeups_received, polls_that_made_progress) and exports the ratio or the absolute spurious count.
  • Per-Future counter: specific hot Futures (Duplex-shaped stream-stream proxies, TLS sessions) record their own (poll count, work done) tuples.
  • Sampled profiling trigger: if CPU % on a core crosses a threshold, auto-capture a flamegraph — this is cheap when rare because the threshold isn't usually tripped.

The target ratio is "near-zero always." Any sustained non-zero rate is worth investigating.

Why it's cheap

  • One atomic counter increment per poll is orders of magnitude cheaper than any real work the Future would do if it were making progress.
  • The metric is zero 99.99% of the time — it doesn't add to normal-operation cost, it only exists to surface pathologies.
  • Alerting is one-sided: the ceiling is low, the floor is zero, so threshold selection is trivial.

Caveat

The pattern is about detecting the pathology, not diagnosing it. Once the metric fires, the workflow is still patterns/flamegraph-to-upstream-fix — pull the profile, identify the guilty layer, fix it upstream. The metric shortens the incident-to-diagnosis window; it doesn't skip the diagnosis.

Seen in

  • sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — Fly.io's closing lesson: the post explicitly commits to adding this instrumentation to fly-proxy. Canonical wiki statement, reached by retrospective after an incident that took "some number of hours" to diagnose the hard way. Pattern is aspirational-for-Fly.io-as-of-2025-02-26 rather than a shipped system — but the framing is load-bearing and the cost/benefit math is sound.
Last updated · 200 distilled / 1,178 read