CONCEPT Cited by 1 source

NCCL_IB_TIMEOUT¶

Definition¶

NCCL_IB_TIMEOUT is the InfiniBand transport-layer timeout within NCCL that controls how long a connection waits for a downed RDMA port to recover before tearing the connection down. Its effective default works out to approximately 7 seconds with retries factored in — much shorter than most teams realise.

The two-timeout surprise¶

Most discussions of NCCL configuration focus on the PyTorch NCCL watchdog timeout (configurable via init_process_group(timeout=...), typically ~10 minutes). But NCCL_IB_TIMEOUT sits lower in the stack and fires long before the watchdog:

NCCL_IB_TIMEOUT ≈ 7s effective → tears connection → collective is dead
PyTorch watchdog ≈ 10min → only notices the hang much later

Once a single port-down window exceeds ~7s, the connection is irreversibly gone. By the time the watchdog fires, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom of the earlier IB connection teardown, not the root cause.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Operational implications¶

Tuning the PyTorch watchdog timeout alone does not help if the IB timeout already killed the connection.
Databricks tuned their NCCL_IB_TIMEOUT defaults to be more resilient to transient port-down events.
The same port-down signal can be used to proactively crash-and-restart from checkpoint rather than waiting for the slow watchdog path.

Seen in¶

sources/2026-07-01-databricks-gpu-reliability — first detailed public explanation of the two-timeout interaction.

systems/nccl — the system this timeout lives in
concepts/cumulative-downtime-vs-flap-count — the metric that determines whether this timeout fires
concepts/gpu-training-failure-modes — the failure class (crashed jobs) this timeout produces

NCCL_IB_TIMEOUT¶

Definition¶

The two-timeout surprise¶

Operational implications¶

Seen in¶

Related¶