CONCEPT Cited by 1 source

CPU busy-loop incident¶

Definition¶

A CPU busy-loop incident is the recurring operational shape where one or more processes on production hosts peg a core at near-100% with little corresponding useful work, usually because some piece of state-machine code is trapped in a tight poll cycle. The typical incident timeline:

Monitoring tripwires fire on CPU % and usually a user-visible downstream metric (e.g. HTTP error rate).
Platform is otherwise fine — other hosts aren't affected, other services aren't affected, the hardware is healthy.
Bouncing the affected process clears the symptom immediately.
Symptom comes back ("we're in an annoying steady-state of getting paged and bouncing proxies" — Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy), because the triggering condition (a specific input pattern) is still producing the trapped state.
Only deep diagnosis (flamegraph / eBPF profile / core dump) surfaces the actual stuck state machine.

Why bounce-and-hope doesn't work¶

Restarting the process flushes the stuck tasks but leaves the trigger upstream. If the trigger is deterministic on input pattern — a specific TLS close sequence, a specific payload shape, a specific client behaviour — any resumed traffic will hit the same bug and produce the same pathology. The best bounce-and-hope buys you is enough time to let the upstream trigger finish (if it's transient) or to execute a real fix (if it's persistent).

Canonical diagnostic move¶

Flamegraph profiling from an angry host. The giveaway in an async-state-machine busy-loop case is infrastructure dominating the profile — entering / exiting Tokio tracing spans, raw poll calls, short libc syscalls that return immediately — with the actual business logic invisible. Fly.io's flamegraph framed the whole diagnosis.

Seen in¶

sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — two IAD edge hosts running systems/fly-proxy pegged CPU
spiked HTTP errors intermittently over "some number of hours". Bouncing fly-proxy cleared the condition each time; the trigger was a partner (Tigris) load test producing early- closing TLS connections that exposed a Waker bug in tokio-rustls. Bounce-and-hope was the tactical response; flamegraph + upstream fix was the actual resolution.

concepts/spurious-wakeup-busy-loop — the async-Rust / poll-driven sub-pathology this concept generalises.
concepts/flamegraph-profiling — the diagnostic move.
patterns/flamegraph-to-upstream-fix — the end-to-end arc.
patterns/spurious-wakeup-metric — cheap instrumentation that shortens time-to-diagnosis for this class of bug.
systems/fly-proxy
companies/flyio

CPU busy-loop incident¶

Definition¶

Why bounce-and-hope doesn't work¶

Canonical diagnostic move¶

Seen in¶

Related¶