Skip to content

SYSTEM Cited by 1 source

NCCL

Definition

NCCL (NVIDIA Collective Communications Library) is NVIDIA's library implementing collective operations (all-reduce, all-gather, reduce-scatter, broadcast, etc.) optimised for multi-GPU and multi-node topologies. It is the de facto standard communication substrate for distributed deep learning frameworks (PyTorch, TensorFlow, JAX) and is the layer that crashes when GPU training jobs fail with the signature "NCCL watchdog timeout" message.

Two timeout layers

Databricks' 2026-07-01 GPU reliability post reveals a critical operational subtlety: NCCL has two independent timeout mechanisms that most teams conflate:

  1. PyTorch NCCL watchdog timeout — configurable via init_process_group(timeout=...), typically set to 10 minutes. Kills a hung collective after the configured duration.

  2. NCCL_IB_TIMEOUT — the InfiniBand transport-layer timeout, controlling how long a connection waits for a downed port to recover. Its effective default works out to approximately 7 seconds with retries factored in. Once a port-down window exceeds this, the connection tears down irreversibly.

The IB timeout fires long before the PyTorch watchdog. By the time the watchdog notices, the run is already committed to crash. The NCCL watchdog timeout message in logs is a symptom, not the cause.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Algorithm switching by payload size

NCCL triggers different code paths depending on message size:

Payload range Protocol / algorithm Dominant metric
KB range Low-latency (LL, LL128) p95 latency
MB range Tree → ring crossover Transitional
GB range Chunking + pipelining BusBW (bus bandwidth)

Hardware issues often manifest in only one code path — comprehensive health checks must sweep the full size spectrum.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Seen in

Last updated · 567 distilled / 1,685 read