Skip to content

SYSTEM Cited by 1 source

InfiniBand

InfiniBand is a low-latency, high-bandwidth inter-node RDMA fabric widely used for HPC and large-scale AI training. It provides RDMA (remote direct memory access), kernel bypass, and hardware-offloaded collectives — the properties that make large distributed-training collective operations (all-reduce, all-gather, reduce-scatter) efficient across many nodes.

Contrast with NVLink (intra-node GPU-to-GPU, hundreds of GB/s per link): InfiniBand runs between nodes, at bandwidths an order of magnitude lower than NVLink per link but across the whole cluster.

Seen in (wiki)

Why it matters for LLM training

  • Data-parallel gradient all-reduce crosses nodes once per training step. InfiniBand's bandwidth × latency × collective-offload determines how much of the step is communication-bound.
  • Pipeline parallelism naturally tolerates the slower inter-node hop — point-to-point activation handoffs between stage boundaries, lower frequency of communication — which is why PP spans across InfiniBand while TP stays within the NVLink domain.
  • Topology choice (fat-tree vs dragonfly vs torus; non-blocking vs oversubscribed) shapes collective performance at scale. Not disclosed for the eBay cluster.
Last updated · 200 distilled / 1,178 read