Skip to content

CONCEPT Cited by 1 source

CPU starvation of network driver threads

A class of incident where a network driver's kernel thread does not get scheduled onto a CPU core for multiple seconds, during which the hardware driver concludes the device is misbehaving and performs a self-healing reset. The network layer appears broken; the root cause lives in whichever userspace or kernel code consumed the core the network-driver thread was waiting on.

Mechanism

Linux network drivers bind Tx/Rx softirq / NAPI threads to specific CPU cores (sometimes steerable via /proc/irq/*/smp_affinity or ethtool). If the driver's work isn't executed within an expected liveness window — for AWS ENA the threshold is a hard-coded 5 s on Tx pause — the driver trips its self-heal path and issues a device reset. A reset is fast (<1 ms) but causes transient packet loss that TCP usually recovers transparently. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

Why this shows up as network failures

The symptom — "lost network connectivity", Ray ObjectFetchTimedOutError, ActorDied, failed health checks — points at the network stack. The cause is CPU scheduling. Investigators who chase network-layer hypotheses (MTU, connection limits, driver versions) spin for weeks. The disciplined debugging move is to treat the ENA reset log line as a CPU-starvation metric and start looking for whoever is burning CPU on the saturated core at reset time.

Why aggregate CPU metrics hide it

One saturated core can starve a network-driver thread scheduled onto it even when the whole-machine CPU utilisation is benign. On a 96-vCPU box, a single core at 100% represents ~1% aggregate — hidden by any dashboard that averages across cores. Per-core visibility via mpstat -P ALL 1 is the canonical triage instrument.

Common causes

  • Zombie kernel state iterated inline. Zombie memory cgroups iterated by mem_cgroup_nr_lru_pages (Pinterest's canonical 2025 incident).
  • Hot-path kernel syscalls over large bookkeeping structures. Any O(N) loop in a kernel code path where N is user-controlled kernel-state count.
  • Userspace workloads with runaway single-threaded behaviour. Compression, crypto, JIT warmup, etc., pinned to a core without CPU quota isolation.
  • Misconfigured interrupt affinity. Driver IRQs pinned to cores the workload also wants.

Mitigations that don't fix the real problem

Pinterest tried the usual textbook CPU-starvation mitigations pre-root-cause and none helped, which is diagnostic:

  • Transparent Huge Pages to cut page-faulting.
  • jemalloc in place of glibc to reduce allocator contention.
  • taskset CPU affinity to give training workloads pinned cores.
  • Interrupt pinning of ENA IRQs onto other cores.

When these standard levers don't move the symptom, the real culprit is something consuming CPU in a way those levers can't steer — in Pinterest's case, %sys (not %user) in a kernel syscall.

Fix

Find and kill the CPU consumer. This usually requires temporal profiling to catch the spike window, since the starvation is sporadic and a random perf sample rarely catches it.

Seen in

Last updated · 550 distilled / 1,221 read