CONCEPT Cited by 1 source

CPU starvation of network driver threads¶

A class of incident where a network driver's kernel thread does not get scheduled onto a CPU core for multiple seconds, during which the hardware driver concludes the device is misbehaving and performs a self-healing reset. The network layer appears broken; the root cause lives in whichever userspace or kernel code consumed the core the network-driver thread was waiting on.

Mechanism¶

Linux network drivers bind Tx/Rx softirq / NAPI threads to specific CPU cores (sometimes steerable via /proc/irq/*/smp_affinity or ethtool). If the driver's work isn't executed within an expected liveness window — for AWS ENA the threshold is a hard-coded 5 s on Tx pause — the driver trips its self-heal path and issues a device reset. A reset is fast (<1 ms) but causes transient packet loss that TCP usually recovers transparently. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

Why this shows up as network failures¶

The symptom — "lost network connectivity", Ray ObjectFetchTimedOutError, ActorDied, failed health checks — points at the network stack. The cause is CPU scheduling. Investigators who chase network-layer hypotheses (MTU, connection limits, driver versions) spin for weeks. The disciplined debugging move is to treat the ENA reset log line as a CPU-starvation metric and start looking for whoever is burning CPU on the saturated core at reset time.

Why aggregate CPU metrics hide it¶

One saturated core can starve a network-driver thread scheduled onto it even when the whole-machine CPU utilisation is benign. On a 96-vCPU box, a single core at 100% represents ~1% aggregate — hidden by any dashboard that averages across cores. Per-core visibility via mpstat -P ALL 1 is the canonical triage instrument.

Common causes¶

Zombie kernel state iterated inline. Zombie memory cgroups iterated by mem_cgroup_nr_lru_pages (Pinterest's canonical 2025 incident).
Hot-path kernel syscalls over large bookkeeping structures. Any O(N) loop in a kernel code path where N is user-controlled kernel-state count.
Userspace workloads with runaway single-threaded behaviour. Compression, crypto, JIT warmup, etc., pinned to a core without CPU quota isolation.
Misconfigured interrupt affinity. Driver IRQs pinned to cores the workload also wants.

Mitigations that don't fix the real problem¶

Pinterest tried the usual textbook CPU-starvation mitigations pre-root-cause and none helped, which is diagnostic:

Transparent Huge Pages to cut page-faulting.
jemalloc in place of glibc to reduce allocator contention.
taskset CPU affinity to give training workloads pinned cores.
Interrupt pinning of ENA IRQs onto other cores.

When these standard levers don't move the symptom, the real culprit is something consuming CPU in a way those levers can't steer — in Pinterest's case, %sys (not %user) in a kernel syscall.

Fix¶

Find and kill the CPU consumer. This usually requires temporal profiling to catch the spike window, since the starvation is sporadic and a random perf sample rarely catches it.

Seen in¶

sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production instance. ENA Tx-paused-5 s resets on Pinterest's Kubernetes GPU fleet resolved to kubelet + mem_cgroup_nr_lru_pages + zombie memcgs + crash-looping ecs-agent + Deep Learning AMI default systemd unit. 3 months from symptom to root cause.

concepts/zombie-memory-cgroup — the Pinterest-incident cause
concepts/network-driver-reset — the symptom shape
concepts/per-core-cpu-visibility — the triage discipline
concepts/noisy-neighbor — the general family
systems/aws-ena-driver — the 5 s-threshold driver