SYSTEM Cited by 1 source
NCCL¶
Definition¶
NCCL (NVIDIA Collective Communications Library) is NVIDIA's library implementing collective operations (all-reduce, all-gather, reduce-scatter, broadcast, etc.) optimised for multi-GPU and multi-node topologies. It is the de facto standard communication substrate for distributed deep learning frameworks (PyTorch, TensorFlow, JAX) and is the layer that crashes when GPU training jobs fail with the signature "NCCL watchdog timeout" message.
Two timeout layers¶
Databricks' 2026-07-01 GPU reliability post reveals a critical operational subtlety: NCCL has two independent timeout mechanisms that most teams conflate:
-
PyTorch NCCL watchdog timeout — configurable via
init_process_group(timeout=...), typically set to 10 minutes. Kills a hung collective after the configured duration. -
NCCL_IB_TIMEOUT — the InfiniBand transport-layer timeout, controlling how long a connection waits for a downed port to recover. Its effective default works out to approximately 7 seconds with retries factored in. Once a port-down window exceeds this, the connection tears down irreversibly.
The IB timeout fires long before the PyTorch watchdog. By the time the watchdog notices, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom, not the cause.
(Source: sources/2026-07-01-databricks-gpu-reliability)
Algorithm switching by payload size¶
NCCL triggers different code paths depending on message size:
| Payload range | Protocol / algorithm | Dominant metric |
|---|---|---|
| KB range | Low-latency (LL, LL128) | p95 latency |
| MB range | Tree → ring crossover | Transitional |
| GB range | Chunking + pipelining | BusBW (bus bandwidth) |
Hardware issues often manifest in only one code path — comprehensive health checks must sweep the full size spectrum.
(Source: sources/2026-07-01-databricks-gpu-reliability)
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — detailed NCCL timeout behaviour and bandwidth validation approach.
Related¶
- concepts/nccl-ib-timeout — the specific low-level timeout that actually kills runs
- systems/gpu-monitor — Databricks' system that validates NCCL bandwidth across payload sizes
- concepts/gpu-training-failure-modes — the failure modes NCCL timeouts surface