SYSTEM Cited by 3 sources
InfiniBand¶
InfiniBand is a low-latency, high-bandwidth inter-node RDMA fabric widely used for HPC and large-scale AI training. It provides RDMA (remote direct memory access), kernel bypass, and hardware-offloaded collectives — the properties that make large distributed-training collective operations (all-reduce, all-gather, reduce-scatter) efficient across many nodes.
Contrast with NVLink (intra-node GPU-to-GPU, hundreds of GB/s per link): InfiniBand runs between nodes, at bandwidths an order of magnitude lower than NVLink per link but across the whole cluster.
Seen in (wiki)¶
- eBay e-Llama training cluster. InfiniBand is explicitly named as the inter-node interconnect (with NVLink for intra-node) on the 60-node × 8-GPU × H100 = 480-GPU fleet used for continued pretraining of Llama 3.1 8B + 70B. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
- Meta 24K-GPU GenAI cluster (2024-06). One of Meta's two 24,000-GPU H100 clusters — the InfiniBand one — optimised for full-bisection bandwidth, evolved from Meta's 16K-GPU InfiniBand AI Research SuperCluster into a production-integrated GenAI substrate. Used (alongside the sibling RoCE cluster) to train Llama 3. First wiki reference to production-integrated InfiniBand at 24K-GPU scale. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Why it matters for LLM training¶
- Data-parallel gradient all-reduce crosses nodes once per training step. InfiniBand's bandwidth × latency × collective-offload determines how much of the step is communication-bound.
- Pipeline parallelism naturally tolerates the slower inter-node hop — point-to-point activation handoffs between stage boundaries, lower frequency of communication — which is why PP spans across InfiniBand while TP stays within the NVLink domain.
- Topology choice (fat-tree vs dragonfly vs torus; non-blocking vs oversubscribed) shapes collective performance at scale. Not disclosed for the eBay cluster.
Related¶
- systems/roce-rdma-over-converged-ethernet — the Ethernet-based alternative Meta built alongside at equal scale. Meta's 2024-08-05 SIGCOMM paper is the RoCE-side deep-dive; InfiniBand is the benchmark it reaches parity with. Meta explicitly notes that on RoCE they had to invest in explicit QP-based hashing to reach the fabric-level behaviour IB provides via native adaptive routing.
- systems/nvlink — the intra-node counterpart.
- systems/nvidia-h100 — the GPUs whose RDMA-capable NICs terminate the InfiniBand fabric.
- systems/megatron-lm — the training framework whose 3D parallelism exploits the NVLink + InfiniBand split.
- systems/meta-genai-cluster-infiniband — Meta's production 24K-GPU InfiniBand cluster.
- systems/llama-3 — trained on Meta's InfiniBand + RoCE clusters.
- concepts/pipeline-parallelism / concepts/data-parallelism / concepts/3d-parallelism
- concepts/collective-communication-topology-awareness / concepts/fat-flow-load-balancing — optimisations that apply to InfiniBand fabrics at 24K-GPU scale.
- concepts/rdma-kv-transfer — adjacent RDMA-over-fabric pattern on the inference side.