SYSTEM Cited by 1 source
InfiniBand¶
InfiniBand is a low-latency, high-bandwidth inter-node RDMA fabric widely used for HPC and large-scale AI training. It provides RDMA (remote direct memory access), kernel bypass, and hardware-offloaded collectives — the properties that make large distributed-training collective operations (all-reduce, all-gather, reduce-scatter) efficient across many nodes.
Contrast with NVLink (intra-node GPU-to-GPU, hundreds of GB/s per link): InfiniBand runs between nodes, at bandwidths an order of magnitude lower than NVLink per link but across the whole cluster.
Seen in (wiki)¶
- eBay e-Llama training cluster. InfiniBand is explicitly named as the inter-node interconnect (with NVLink for intra-node) on the 60-node × 8-GPU × H100 = 480-GPU fleet used for continued pretraining of Llama 3.1 8B + 70B. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
Why it matters for LLM training¶
- Data-parallel gradient all-reduce crosses nodes once per training step. InfiniBand's bandwidth × latency × collective-offload determines how much of the step is communication-bound.
- Pipeline parallelism naturally tolerates the slower inter-node hop — point-to-point activation handoffs between stage boundaries, lower frequency of communication — which is why PP spans across InfiniBand while TP stays within the NVLink domain.
- Topology choice (fat-tree vs dragonfly vs torus; non-blocking vs oversubscribed) shapes collective performance at scale. Not disclosed for the eBay cluster.
Related¶
- systems/nvlink — the intra-node counterpart.
- systems/nvidia-h100 — the GPUs whose RDMA-capable NICs terminate the InfiniBand fabric.
- systems/megatron-lm — the training framework whose 3D parallelism exploits the NVLink + InfiniBand split.
- concepts/pipeline-parallelism / concepts/data-parallelism / concepts/3d-parallelism
- concepts/rdma-kv-transfer — adjacent RDMA-over-fabric pattern on the inference side.