Skip to content

CONCEPT Cited by 1 source

NCCL_IB_TIMEOUT

Definition

NCCL_IB_TIMEOUT is the InfiniBand transport-layer timeout within NCCL that controls how long a connection waits for a downed RDMA port to recover before tearing the connection down. Its effective default works out to approximately 7 seconds with retries factored in — much shorter than most teams realise.

The two-timeout surprise

Most discussions of NCCL configuration focus on the PyTorch NCCL watchdog timeout (configurable via init_process_group(timeout=...), typically ~10 minutes). But NCCL_IB_TIMEOUT sits lower in the stack and fires long before the watchdog:

  • NCCL_IB_TIMEOUT ≈ 7s effective → tears connection → collective is dead
  • PyTorch watchdog ≈ 10min → only notices the hang much later

Once a single port-down window exceeds ~7s, the connection is irreversibly gone. By the time the watchdog fires, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom of the earlier IB connection teardown, not the root cause.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Operational implications

  • Tuning the PyTorch watchdog timeout alone does not help if the IB timeout already killed the connection.
  • Databricks tuned their NCCL_IB_TIMEOUT defaults to be more resilient to transient port-down events.
  • The same port-down signal can be used to proactively crash-and-restart from checkpoint rather than waiting for the slow watchdog path.

Seen in

Last updated · 567 distilled / 1,685 read