Skip to content

CONCEPT Cited by 1 source

Network driver reset

A network driver reset (or "device reset") is a self-healing path in a Linux network driver — the driver concludes the hardware or its own kernel threads have stopped making expected forward progress and re-initialises the device. The reset typically takes <1 ms and causes transient packet loss that TCP recovers transparently for well-behaved transport workloads. But the cause of the reset is almost always something worth investigating — network driver resets are a symptom surface, not a bug class.

Canonical triggers (AWS ENA reference)

From the AWS ENA Linux driver's best-practices documentation, the driver resets on:

  • Unresponsive device. Hardware MMIO read/write timing out.
  • Missing keep-alive events. The device-to-driver liveness heartbeat lapses.
  • Tx completion timeouts. The Tx thread hasn't completed its queued work in the expected window — hardcoded to 5 s for ENA. This is the trigger Pinterest's 2025 incident hit. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
  • netdev timeout. Kernel-layer watchdog on the network device.

Kernel log signature (ENA example)

ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000).
Time since last napi 6596000 usec. napi scheduled: 1
ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time
but is scheduled
...
ena 0000:20:03.0: Device reset completed successfully, Driver info:
Elastic Network Adapter (ENA) v2.11.0g

Key fields: TX q N is paused for too long (threshold in μs), Time since last napi (how long the NAPI poll was delayed), napi scheduled: 1 (work was queued; something prevented it from running).

Why resets are interesting despite "self-healing"

  • <1 ms downtime is small for TCP, but large for latency-tight protocols. Pinterest's Ray distributed training saw ObjectFetchTimedOutError, ActorDiedError, and node health-check failures because Ray's gRPC-over-TCP control plane had short liveness windows and stateful actor invariants that didn't gracefully survive the reset window.
  • Reset cadence is diagnostic. Frequent resets on one host (especially one AZ or one instance family) point at a real issue — CPU starvation, hardware degradation, misconfigured interrupts.
  • The reset absorbs the bug. If you don't have /var/log/kern.log grep discipline, the only visible symptom is a brief network glitch — the cause gets normalized away.

Resetting is not a fix; chase the cause

Most resets in production fleets are CPU-starvation-induced — the driver's NAPI thread didn't get CPU time, not the hardware misbehaving. Triage discipline:

  1. dmesg | grep -i 'ena\|driver reset' — find recent resets.
  2. Correlate reset timestamps with per-core CPU metrics (concepts/per-core-cpu-visibility).
  3. If one core is saturated at reset time, start temporal profiling to identify the consumer.

Seen in

Last updated · 550 distilled / 1,221 read