CONCEPT Cited by 1 source
Network driver reset¶
A network driver reset (or "device reset") is a self-healing path in a Linux network driver — the driver concludes the hardware or its own kernel threads have stopped making expected forward progress and re-initialises the device. The reset typically takes <1 ms and causes transient packet loss that TCP recovers transparently for well-behaved transport workloads. But the cause of the reset is almost always something worth investigating — network driver resets are a symptom surface, not a bug class.
Canonical triggers (AWS ENA reference)¶
From the AWS ENA Linux driver's best-practices documentation, the driver resets on:
- Unresponsive device. Hardware MMIO read/write timing out.
- Missing keep-alive events. The device-to-driver liveness heartbeat lapses.
- Tx completion timeouts. The Tx thread hasn't completed its queued work in the expected window — hardcoded to 5 s for ENA. This is the trigger Pinterest's 2025 incident hit. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
- netdev timeout. Kernel-layer watchdog on the network device.
Kernel log signature (ENA example)¶
ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000).
Time since last napi 6596000 usec. napi scheduled: 1
ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time
but is scheduled
...
ena 0000:20:03.0: Device reset completed successfully, Driver info:
Elastic Network Adapter (ENA) v2.11.0g
Key fields: TX q N is paused for too long (threshold in μs),
Time since last napi (how long the NAPI poll was delayed), napi
scheduled: 1 (work was queued; something prevented it from
running).
Why resets are interesting despite "self-healing"¶
- <1 ms downtime is small for TCP, but large for latency-tight
protocols. Pinterest's Ray distributed training saw
ObjectFetchTimedOutError,ActorDiedError, and node health-check failures because Ray's gRPC-over-TCP control plane had short liveness windows and stateful actor invariants that didn't gracefully survive the reset window. - Reset cadence is diagnostic. Frequent resets on one host (especially one AZ or one instance family) point at a real issue — CPU starvation, hardware degradation, misconfigured interrupts.
- The reset absorbs the bug. If you don't have
/var/log/kern.loggrep discipline, the only visible symptom is a brief network glitch — the cause gets normalized away.
Resetting is not a fix; chase the cause¶
Most resets in production fleets are CPU-starvation-induced — the driver's NAPI thread didn't get CPU time, not the hardware misbehaving. Triage discipline:
dmesg | grep -i 'ena\|driver reset'— find recent resets.- Correlate reset timestamps with per-core CPU metrics (concepts/per-core-cpu-visibility).
- If one core is saturated at reset time, start temporal profiling to identify the consumer.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical case study. ENA resets every few hours on Pinterest's
Kubernetes GPU fleet were the starting symptom of a 3-month
investigation that ended at
zombie memory cgroups leaked by
a crash-looping
ecs-agentsystemd unit on the Deep Learning AMI. Caveat: the driver was never at fault; the reset log line is what first made the incident observable.
Related¶
- concepts/cpu-starvation-network-driver — the dominant cause
- concepts/zombie-memory-cgroup — the Pinterest-incident sub-cause
- systems/aws-ena-driver — the ENA-specific reset mechanism