CONCEPT Cited by 1 source
CPU starvation of network driver threads¶
A class of incident where a network driver's kernel thread does not get scheduled onto a CPU core for multiple seconds, during which the hardware driver concludes the device is misbehaving and performs a self-healing reset. The network layer appears broken; the root cause lives in whichever userspace or kernel code consumed the core the network-driver thread was waiting on.
Mechanism¶
Linux network drivers bind Tx/Rx softirq / NAPI threads to specific
CPU cores (sometimes steerable via /proc/irq/*/smp_affinity or
ethtool). If the driver's work isn't executed within an expected
liveness window — for AWS ENA the threshold
is a hard-coded 5 s on Tx pause — the driver trips its self-heal
path and issues a device reset.
A reset is fast (<1 ms) but causes transient packet loss that TCP
usually recovers transparently. (Source:
sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
Why this shows up as network failures¶
The symptom — "lost network connectivity", Ray
ObjectFetchTimedOutError, ActorDied, failed health checks — points
at the network stack. The cause is CPU scheduling. Investigators
who chase network-layer hypotheses (MTU, connection limits, driver
versions) spin for weeks. The disciplined debugging move is to
treat the ENA reset log line as a CPU-starvation metric and start
looking for whoever is burning CPU on the saturated core at reset
time.
Why aggregate CPU metrics hide it¶
One saturated core can starve a network-driver thread scheduled onto
it even when the whole-machine CPU utilisation is benign. On a
96-vCPU box, a single core at 100% represents ~1% aggregate — hidden
by any dashboard that averages across cores.
Per-core visibility via
mpstat -P ALL 1 is the canonical triage instrument.
Common causes¶
- Zombie kernel state iterated inline.
Zombie memory cgroups iterated by
mem_cgroup_nr_lru_pages(Pinterest's canonical 2025 incident). - Hot-path kernel syscalls over large bookkeeping structures. Any O(N) loop in a kernel code path where N is user-controlled kernel-state count.
- Userspace workloads with runaway single-threaded behaviour. Compression, crypto, JIT warmup, etc., pinned to a core without CPU quota isolation.
- Misconfigured interrupt affinity. Driver IRQs pinned to cores the workload also wants.
Mitigations that don't fix the real problem¶
Pinterest tried the usual textbook CPU-starvation mitigations pre-root-cause and none helped, which is diagnostic:
- Transparent Huge Pages to cut page-faulting.
- jemalloc in place of glibc to reduce allocator contention.
tasksetCPU affinity to give training workloads pinned cores.- Interrupt pinning of ENA IRQs onto other cores.
When these standard levers don't move the symptom, the real culprit
is something consuming CPU in a way those levers can't steer — in
Pinterest's case, %sys (not %user) in a kernel syscall.
Fix¶
Find and kill the CPU consumer. This usually requires
temporal profiling to catch the spike
window, since the starvation is sporadic and a random perf sample
rarely catches it.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical production instance. ENA Tx-paused-5 s resets on
Pinterest's Kubernetes GPU fleet resolved to kubelet +
mem_cgroup_nr_lru_pages+ zombie memcgs + crash-looping ecs-agent + Deep Learning AMI default systemd unit. 3 months from symptom to root cause.
Related¶
- concepts/zombie-memory-cgroup — the Pinterest-incident cause
- concepts/network-driver-reset — the symptom shape
- concepts/per-core-cpu-visibility — the triage discipline
- concepts/noisy-neighbor — the general family
- systems/aws-ena-driver — the 5 s-threshold driver