SYSTEM Cited by 1 source

NCCL¶

Definition¶

NCCL (NVIDIA Collective Communications Library) is NVIDIA's library implementing collective operations (all-reduce, all-gather, reduce-scatter, broadcast, etc.) optimised for multi-GPU and multi-node topologies. It is the de facto standard communication substrate for distributed deep learning frameworks (PyTorch, TensorFlow, JAX) and is the layer that crashes when GPU training jobs fail with the signature "NCCL watchdog timeout" message.

Two timeout layers¶

Databricks' 2026-07-01 GPU reliability post reveals a critical operational subtlety: NCCL has two independent timeout mechanisms that most teams conflate:

PyTorch NCCL watchdog timeout — configurable via init_process_group(timeout=...), typically set to 10 minutes. Kills a hung collective after the configured duration.
NCCL_IB_TIMEOUT — the InfiniBand transport-layer timeout, controlling how long a connection waits for a downed port to recover. Its effective default works out to approximately 7 seconds with retries factored in. Once a port-down window exceeds this, the connection tears down irreversibly.

The IB timeout fires long before the PyTorch watchdog. By the time the watchdog notices, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom, not the cause.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Algorithm switching by payload size¶

NCCL triggers different code paths depending on message size:

Payload range	Protocol / algorithm	Dominant metric
KB range	Low-latency (LL, LL128)	p95 latency
MB range	Tree → ring crossover	Transitional
GB range	Chunking + pipelining	BusBW (bus bandwidth)

Hardware issues often manifest in only one code path — comprehensive health checks must sweep the full size spectrum.