Skip to content

SYSTEM Cited by 1 source

AWS ENA driver (Elastic Network Adapter)

The ENA Linux kernel driver (amzn/amzn-drivers) is the standard Linux network driver for AWS EC2 instance types using Elastic Network Interfaces (ENIs). It sets up per-queue receive and transmit rings for packet buffering against the underlying ENI.

Self-healing reset mechanism

The ENA driver includes a device-reset self-healing path triggered when the driver detects unexpected device behaviour (from the upstream ENA Linux Best Practices doc):

  • Unresponsive device
  • Missing keep-alive events
  • Tx completion timeouts (the 2025 Pinterest-incident trigger)
  • netdev timeout

"The device reset is a rare event, lasts less than a millisecond and might incur loss of traffic during this time, which is expected to be recovered by the transport protocol in the instance kernel."

The Tx-paused threshold is hard-coded to 5 s — if the driver's Tx queue thread doesn't get CPU time for 5 s, the driver concludes something is wrong and resets. The kernel log signature:

ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000).
Time since last napi 6596000 usec. napi scheduled: 1
ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time
but is scheduled
...
ena 0000:20:03.0: Device reset completed successfully, Driver info:
Elastic Network Adapter (ENA) v2.11.0g

Reset causes

The ENA documentation names CPU starvation explicitly — whenever the ENA driver's kernel threads don't get CPU time for several seconds, the reset fires. This is why the ENA reset is a symptom, not a root cause: it points at whatever is consuming CPU to the exclusion of the network-driver thread. See concepts/cpu-starvation-network-driver.

Operational implications

  • Transport-layer recovery expected. The <1 ms reset is small enough that TCP recovers transparently for most workloads.
  • Network-sensitive workloads can still fail. Long-lived gRPC streams + distributed training jobs with strict liveness windows (e.g. Ray's object-reference expiry) can see correlated failures across the reset window even when TCP retransmits succeed. See sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — the Pinterest incident manifested as Ray training-job crashes with various Ray-level symptoms (ObjectFetchTimedOutError, ActorDiedError, node-health-check failures), not as generic network errors.
  • Interrupt pinning + CPU affinity mitigations (as Pinterest tried via taskset / ENA interrupt steering) can help if CPU starvation is the direct cause. In Pinterest's case they did not, because the starvation was driven by one core burning %sys in the kernel's memcg iteration loop — not by userspace workload on the ENA thread's home core.

Seen in

Last updated · 319 distilled / 1,201 read