Skip to content

SYSTEM Cited by 1 source

NVLink

NVLink is NVIDIA's high-bandwidth, low-latency intra-node GPU-to-GPU interconnect. It is the communication substrate that makes tensor parallelism viable at per-layer all-reduce cadence — at hundreds of GB/s per link per direction, an all-reduce of layer activations is small enough to hide under compute. PCIe (tens of GB/s) cannot. NVSwitch extends NVLink into a full GPU-to-GPU crossbar within a chassis.

Contrast with InfiniBand, which is the inter-node fabric: NVLink runs within a single node; InfiniBand runs between nodes. Together they form the substrate for concepts/3d-parallelism|3D parallelism across a multi-node training cluster.

Seen in (wiki)

Why it matters

  • Tensor parallelism lives inside the NVLink domain in practice. Per-layer all-reduce / all-gather at model-training batch sizes is bandwidth-heavy; spanning across nodes (InfiniBand) at the same cadence is usually not viable.
  • Pipeline parallelism tolerates the slower inter-node hop — point-to-point activations only between stage boundaries, lower frequency of communication.
  • Practical shape of a 3D-parallel training recipe: TP within the NVLink domain (e.g. 8-way TP across the 8 GPUs in a node), PP across the InfiniBand fabric (e.g. across groups of nodes), DP filling the remaining degree to saturate the fleet.

Stub — expand with specific NVLink generations, bandwidth numbers, NVSwitch topologies, and NVL72/NVL36 rack-scale configurations as more sources cite them.

Last updated · 200 distilled / 1,178 read