Meta — How Meta trains large language models at scale¶
Summary¶
Meta Engineering's 2024-06-12 post describes the infrastructure shift from training many small recommender models on many GPUs to training few, enormous GenAI models on very large GPU fleets. The post is a cross-cutting tour of four stressed layers: hardware reliability, checkpointing, inter-GPU connectivity, and data-center deployment. Meta concurrently built two 24,000-GPU H100 clusters — one on RoCE, one on InfiniBand — and trained Llama 3 on both fabrics, with the largest model trained on the RoCE cluster. The post is the canonical Meta statement on its GenAI training substrate circa the Llama 3 era.
Key takeaways¶
- Workload inversion: few huge jobs, not many small ones. "With the advent of generative AI (GenAI), we've seen a shift towards fewer jobs, but incredibly large ones." Traditional Meta training (ranking, recommendations) ran many models on comparatively small GPU pools; GenAI reverses the shape, which stresses reliability, recovery, and fabric differently. (Source text)
- Four simultaneously-stressed properties at scale. Meta names them: hardware reliability, fast recovery on failure, efficient preservation of training state, and optimal GPU connectivity. All four scale superlinearly-adversely with GPU count — more GPUs → more failures per unit time, more state to checkpoint, more bisection bandwidth required. Together these constrain the achievable job size. (Source text)
- Built both RoCE and InfiniBand clusters — deliberately. Meta had four years of RoCE operational experience but only up to 4K-GPU clusters; InfiniBand experience existed up to 16K GPUs on the AI Research SuperCluster but was not production-integrated. "We decided to build both: two 24k clusters, one with RoCE and another with InfiniBand." RoCE was optimised for build speed; InfiniBand for full-bisection bandwidth. Both were tuned to equivalent performance on large GenAI workloads, and Llama 3 was trained on both — the largest model on the RoCE cluster. (Source text; see patterns/build-both-fabric-alternatives)
- Adapted Grand Teton to H100 at 700 W. The NVIDIA H100 rollout was made via a modified Grand Teton platform (Meta's open-sourced AI training server chassis). Meta increased GPU TDP to 700 W, moved to HBM3, and — because the cooling infrastructure could not be changed in time — kept air cooling rather than moving to liquid. This forced mechanical and thermal redesign on a tight schedule. "All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule." (Source text)
- Data-center power/cooling is a harder constraint than silicon. "Data center power and cooling infrastructure cannot be changed quickly (or easily) and we had to find an optimal layout that allowed maximum compute capability within a data hall." Meta relocated supporting services (readers) out of the data hall to free power/space, and packed GPU racks as densely as possible to maximise compute and bisection-bandwidth within the largest network cluster. Data-hall layout is load-bearing; see patterns/data-center-density-optimization. (Source text)
- Top failure modes at scale (observed most often in early-life hardware, settling as servers age): GPUs falling off PCIe (not detected by the host), DRAM/SRAM uncorrectable errors (monitor per-device, initiate RMA past vendor thresholds), network cable failures presenting as unreachable servers. Meta tracks repeat offenders and sometimes takes preventive remediation on early-warning signals. (Source text; see concepts/gpu-training-failure-modes)
- Three network optimisations made the two 24K-GPU fabrics equally fast. (a) Assign communication patterns from different parallelism axes (DP / TP / PP) to different layers of the network topology so topology bandwidth is efficiently spent. (b) Topology-aware collectives — custom algorithms like recursive doubling/halving replacing default ring-all-reduce — to tolerate higher latency. (c) Load balancing for fat flows. GenAI training produces fat flows (large, long-lived tensor transfers) like ranking jobs do, which don't naturally distribute across ECMP paths; Meta invested further in routing + network load balancing to spread them across available paths. (Source text; see concepts/collective-communication-topology-awareness, concepts/fat-flow-load-balancing)
- PyTorch as the training-software surface. Meta exposes PyTorch (and internal algorithms / tools) to researchers to keep research-to-production fast. The post does not disclose Meta's specific parallelism framework (Megatron-LM / FSDP / custom) for Llama 3, only the fabric and hardware substrate. (Source text)
Systems / hardware extracted¶
- systems/grand-teton — Meta's open-sourced AI training platform, modified for H100 700 W + HBM3, kept air-cooled due to data-center-infrastructure constraints.
- systems/nvidia-h100 — the GPU underlying both 24K clusters; Meta runs them at 700 W TDP (above stock).
- systems/meta-genai-cluster-roce — the 24K-GPU RoCE cluster; optimised for build speed; hosted the largest Llama 3 training run.
- systems/meta-genai-cluster-infiniband — the 24K-GPU InfiniBand cluster; optimised for full-bisection bandwidth.
- systems/roce-rdma-over-converged-ethernet — the Ethernet-based RDMA fabric Meta matured for production AI; 2024-06 scale step was 4K → 24K GPUs.
- systems/infiniband — the HPC-native RDMA fabric; scaled from the 16K-GPU AI Research SuperCluster to a production-integrated 24K-GPU GenAI cluster for Llama 3.
- systems/llama-3 — trained on both clusters; largest model on RoCE.
Concepts extracted¶
- concepts/training-checkpoint — efficient preservation of training state as a named first-class scaling property.
- concepts/hardware-reliability-at-scale — failure rate scales with cluster size; minimising interruptions requires rigorous testing + automation + spare capacity.
- concepts/gpu-training-failure-modes — GPU-falls-off-PCIe, DRAM/SRAM UCE, network-cable unreachability; early-life-biased distribution.
- concepts/collective-communication-topology-awareness — replacing default (ring) collectives with topology-aware algorithms like recursive doubling/halving.
- concepts/fat-flow-load-balancing — training produces long-lived, high-bandwidth flows that defeat ECMP hashing; explicit routing / load balancing is required.
Existing concepts reinforced:
- concepts/3d-parallelism — Meta explicitly maps DP / TP / PP onto different layers of the network topology; the post is a Meta-side confirmation of the substrate eBay's e-Llama post described in open-source terms.
- concepts/data-parallelism / concepts/pipeline-parallelism — surfaced as the axes whose communication patterns are assigned to different topology layers.
Patterns extracted¶
- patterns/build-both-fabric-alternatives — when tradeoffs between two production fabrics are unclear at scale, build at-scale of both and compare operationally rather than forecasting which will win.
- patterns/data-center-density-optimization — relocate non-compute services out of the GPU data hall, pack GPU racks densely within a single network cluster to maximise compute density and bisection bandwidth under fixed power/cooling.
Operational / architectural numbers¶
| Datum | Value |
|---|---|
| Cluster size (each) | 24,000 GPUs × 2 (RoCE + InfiniBand) |
| GPU | NVIDIA H100, 80 GB, HBM3 |
| GPU TDP (Meta's config) | 700 W (increased from stock) |
| Cooling | Air (could not change infra in time for H100 rollout) |
| Platform | Modified Grand Teton |
| Prior RoCE production scale | 4,000 GPUs |
| Prior InfiniBand scale (research) | 16,000 GPUs (AI Research SuperCluster) |
| Model trained on both | Llama 3 |
| Largest model cluster | RoCE |
Specific per-job utilisation, MFU, checkpoint cadence, and parallelism degrees are not disclosed in this post.
Caveats¶
- The post is a tour, not a deep dive on any single layer. Specific numbers for fabric configurations (subscription ratios, switch radix, congestion control), checkpoint implementation (frequency, media, granularity), and parallelism framework (Megatron-LM vs FSDP vs internal) are not disclosed.
- "Equivalent performance" between RoCE and InfiniBand is asserted without per-workload benchmarks. Meta says tuning made them equivalent on large GenAI workloads; results on other workloads (e.g. HPC collectives, non-LLM training) are not claimed.
- Related Meta posts (Building Meta's GenAI Infrastructure, 2024-03-12, and Scaling RoCE Networks for AI Training, Networking @Scale 2023) are referenced but not ingested here.
- The full body reveals PyTorch as the surface but does not disclose the internal parallelism framework used for Llama 3. Treat the parallelism-topology-mapping takeaway as principle rather than recipe.
Source¶
- Original: https://engineering.fb.com/2024/06/12/data-infrastructure/training-large-language-models-at-scale-meta/
- Raw markdown:
raw/meta/2024-06-12-how-meta-trains-large-language-models-at-scale-f3fcaa00.md - HN discussion: https://news.ycombinator.com/item?id=40664339 (404 points)
Related¶
- companies/meta — Meta company page (distinctive systems: Presto, TAO, Haystack, Tupperware, PTP, Owl, and this post's GenAI clusters).
- systems/llama-3-1 — Llama 3.1, the successor family; succeeds the Llama 3 trained on this substrate.
- systems/nvlink — the intra-node interconnect paired with RoCE/InfiniBand for inter-node in both 24K clusters.
- systems/megatron-lm — the open-source framework most commonly orchestrating H100 clusters at this scale; Meta does not disclose whether they use it for Llama 3.