CONCEPT Cited by 1 source
NCCL_IB_TIMEOUT¶
Definition¶
NCCL_IB_TIMEOUT is the InfiniBand transport-layer timeout within NCCL that controls how long a connection waits for a downed RDMA port to recover before tearing the connection down. Its effective default works out to approximately 7 seconds with retries factored in — much shorter than most teams realise.
The two-timeout surprise¶
Most discussions of NCCL configuration focus on the PyTorch NCCL watchdog timeout (configurable via init_process_group(timeout=...), typically ~10 minutes). But NCCL_IB_TIMEOUT sits lower in the stack and fires long before the watchdog:
NCCL_IB_TIMEOUT≈ 7s effective → tears connection → collective is dead- PyTorch watchdog ≈ 10min → only notices the hang much later
Once a single port-down window exceeds ~7s, the connection is irreversibly gone. By the time the watchdog fires, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom of the earlier IB connection teardown, not the root cause.
(Source: sources/2026-07-01-databricks-gpu-reliability)
Operational implications¶
- Tuning the PyTorch watchdog timeout alone does not help if the IB timeout already killed the connection.
- Databricks tuned their
NCCL_IB_TIMEOUTdefaults to be more resilient to transient port-down events. - The same port-down signal can be used to proactively crash-and-restart from checkpoint rather than waiting for the slow watchdog path.
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — first detailed public explanation of the two-timeout interaction.
Related¶
- systems/nccl — the system this timeout lives in
- concepts/cumulative-downtime-vs-flap-count — the metric that determines whether this timeout fires
- concepts/gpu-training-failure-modes — the failure class (crashed jobs) this timeout produces