CONCEPT Cited by 1 source

Network driver reset¶

A network driver reset (or "device reset") is a self-healing path in a Linux network driver — the driver concludes the hardware or its own kernel threads have stopped making expected forward progress and re-initialises the device. The reset typically takes <1 ms and causes transient packet loss that TCP recovers transparently for well-behaved transport workloads. But the cause of the reset is almost always something worth investigating — network driver resets are a symptom surface, not a bug class.

Canonical triggers (AWS ENA reference)¶

From the AWS ENA Linux driver's best-practices documentation, the driver resets on:

Unresponsive device. Hardware MMIO read/write timing out.
Missing keep-alive events. The device-to-driver liveness heartbeat lapses.
Tx completion timeouts. The Tx thread hasn't completed its queued work in the expected window — hardcoded to 5 s for ENA. This is the trigger Pinterest's 2025 incident hit. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
netdev timeout. Kernel-layer watchdog on the network device.

Kernel log signature (ENA example)¶

ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000).
Time since last napi 6596000 usec. napi scheduled: 1
ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time
but is scheduled
...
ena 0000:20:03.0: Device reset completed successfully, Driver info:
Elastic Network Adapter (ENA) v2.11.0g

Key fields: TX q N is paused for too long (threshold in μs), Time since last napi (how long the NAPI poll was delayed), napi scheduled: 1 (work was queued; something prevented it from running).

Why resets are interesting despite "self-healing"¶

<1 ms downtime is small for TCP, but large for latency-tight protocols. Pinterest's Ray distributed training saw ObjectFetchTimedOutError, ActorDiedError, and node health-check failures because Ray's gRPC-over-TCP control plane had short liveness windows and stateful actor invariants that didn't gracefully survive the reset window.
Reset cadence is diagnostic. Frequent resets on one host (especially one AZ or one instance family) point at a real issue — CPU starvation, hardware degradation, misconfigured interrupts.
The reset absorbs the bug. If you don't have /var/log/kern.log grep discipline, the only visible symptom is a brief network glitch — the cause gets normalized away.

Resetting is not a fix; chase the cause¶

Most resets in production fleets are CPU-starvation-induced — the driver's NAPI thread didn't get CPU time, not the hardware misbehaving. Triage discipline:

dmesg | grep -i 'ena\|driver reset' — find recent resets.
Correlate reset timestamps with per-core CPU metrics (concepts/per-core-cpu-visibility).
If one core is saturated at reset time, start temporal profiling to identify the consumer.

Seen in¶

sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical case study. ENA resets every few hours on Pinterest's Kubernetes GPU fleet were the starting symptom of a 3-month investigation that ended at zombie memory cgroups leaked by a crash-looping ecs-agent systemd unit on the Deep Learning AMI. Caveat: the driver was never at fault; the reset log line is what first made the incident observable.

concepts/cpu-starvation-network-driver — the dominant cause
concepts/zombie-memory-cgroup — the Pinterest-incident sub-cause
systems/aws-ena-driver — the ENA-specific reset mechanism