Skip to content

CONCEPT Cited by 1 source

CPU busy-loop incident

Definition

A CPU busy-loop incident is the recurring operational shape where one or more processes on production hosts peg a core at near-100% with little corresponding useful work, usually because some piece of state-machine code is trapped in a tight poll cycle. The typical incident timeline:

  1. Monitoring tripwires fire on CPU % and usually a user-visible downstream metric (e.g. HTTP error rate).
  2. Platform is otherwise fine — other hosts aren't affected, other services aren't affected, the hardware is healthy.
  3. Bouncing the affected process clears the symptom immediately.
  4. Symptom comes back ("we're in an annoying steady-state of getting paged and bouncing proxies" — Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy), because the triggering condition (a specific input pattern) is still producing the trapped state.
  5. Only deep diagnosis (flamegraph / eBPF profile / core dump) surfaces the actual stuck state machine.

Why bounce-and-hope doesn't work

Restarting the process flushes the stuck tasks but leaves the trigger upstream. If the trigger is deterministic on input pattern — a specific TLS close sequence, a specific payload shape, a specific client behaviour — any resumed traffic will hit the same bug and produce the same pathology. The best bounce-and-hope buys you is enough time to let the upstream trigger finish (if it's transient) or to execute a real fix (if it's persistent).

Canonical diagnostic move

Flamegraph profiling from an angry host. The giveaway in an async-state-machine busy-loop case is infrastructure dominating the profile — entering / exiting Tokio tracing spans, raw poll calls, short libc syscalls that return immediately — with the actual business logic invisible. Fly.io's flamegraph framed the whole diagnosis.

Seen in

  • sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — two IAD edge hosts running systems/fly-proxy pegged CPU
  • spiked HTTP errors intermittently over "some number of hours". Bouncing fly-proxy cleared the condition each time; the trigger was a partner (Tigris) load test producing early- closing TLS connections that exposed a Waker bug in tokio-rustls. Bounce-and-hope was the tactical response; flamegraph + upstream fix was the actual resolution.
Last updated · 200 distilled / 1,178 read