CONCEPT Cited by 1 source
CPU busy-loop incident¶
Definition¶
A CPU busy-loop incident is the recurring operational shape where one or more processes on production hosts peg a core at near-100% with little corresponding useful work, usually because some piece of state-machine code is trapped in a tight poll cycle. The typical incident timeline:
- Monitoring tripwires fire on CPU % and usually a user-visible downstream metric (e.g. HTTP error rate).
- Platform is otherwise fine — other hosts aren't affected, other services aren't affected, the hardware is healthy.
- Bouncing the affected process clears the symptom immediately.
- Symptom comes back ("we're in an annoying steady-state of getting paged and bouncing proxies" — Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy), because the triggering condition (a specific input pattern) is still producing the trapped state.
- Only deep diagnosis (flamegraph / eBPF profile / core dump) surfaces the actual stuck state machine.
Why bounce-and-hope doesn't work¶
Restarting the process flushes the stuck tasks but leaves the trigger upstream. If the trigger is deterministic on input pattern — a specific TLS close sequence, a specific payload shape, a specific client behaviour — any resumed traffic will hit the same bug and produce the same pathology. The best bounce-and-hope buys you is enough time to let the upstream trigger finish (if it's transient) or to execute a real fix (if it's persistent).
Canonical diagnostic move¶
Flamegraph profiling from an
angry host. The giveaway in an async-state-machine busy-loop
case is infrastructure dominating the profile — entering /
exiting Tokio tracing spans, raw poll calls,
short libc syscalls that return immediately — with the actual
business logic invisible. Fly.io's flamegraph framed the whole
diagnosis.
Seen in¶
- sources/2025-02-26-flyio-taming-a-voracious-rust-proxy —
two
IADedge hosts running systems/fly-proxy pegged CPU - spiked HTTP errors intermittently over "some number of
hours". Bouncing
fly-proxycleared the condition each time; the trigger was a partner (Tigris) load test producing early- closing TLS connections that exposed a Waker bug in tokio-rustls. Bounce-and-hope was the tactical response; flamegraph + upstream fix was the actual resolution.
Related¶
- concepts/spurious-wakeup-busy-loop — the async-Rust / poll-driven sub-pathology this concept generalises.
- concepts/flamegraph-profiling — the diagnostic move.
- patterns/flamegraph-to-upstream-fix — the end-to-end arc.
- patterns/spurious-wakeup-metric — cheap instrumentation that shortens time-to-diagnosis for this class of bug.
- systems/fly-proxy
- companies/flyio