META 2024-06-12

Meta — How Meta trains large language models at scale¶

Summary¶

Meta Engineering's 2024-06-12 post describes the infrastructure shift from training many small recommender models on many GPUs to training few, enormous GenAI models on very large GPU fleets. The post is a cross-cutting tour of four stressed layers: hardware reliability, checkpointing, inter-GPU connectivity, and data-center deployment. Meta concurrently built two 24,000-GPU H100 clusters — one on RoCE, one on InfiniBand — and trained Llama 3 on both fabrics, with the largest model trained on the RoCE cluster. The post is the canonical Meta statement on its GenAI training substrate circa the Llama 3 era.

Key takeaways¶

Workload inversion: few huge jobs, not many small ones. "With the advent of generative AI (GenAI), we've seen a shift towards fewer jobs, but incredibly large ones." Traditional Meta training (ranking, recommendations) ran many models on comparatively small GPU pools; GenAI reverses the shape, which stresses reliability, recovery, and fabric differently. (Source text)
Four simultaneously-stressed properties at scale. Meta names them: hardware reliability, fast recovery on failure, efficient preservation of training state, and optimal GPU connectivity. All four scale superlinearly-adversely with GPU count — more GPUs → more failures per unit time, more state to checkpoint, more bisection bandwidth required. Together these constrain the achievable job size. (Source text)
Built both RoCE and InfiniBand clusters — deliberately. Meta had four years of RoCE operational experience but only up to 4K-GPU clusters; InfiniBand experience existed up to 16K GPUs on the AI Research SuperCluster but was not production-integrated. "We decided to build both: two 24k clusters, one with RoCE and another with InfiniBand." RoCE was optimised for build speed; InfiniBand for full-bisection bandwidth. Both were tuned to equivalent performance on large GenAI workloads, and Llama 3 was trained on both — the largest model on the RoCE cluster. (Source text; see patterns/build-both-fabric-alternatives)
Adapted Grand Teton to H100 at 700 W. The NVIDIA H100 rollout was made via a modified Grand Teton platform (Meta's open-sourced AI training server chassis). Meta increased GPU TDP to 700 W, moved to HBM3, and — because the cooling infrastructure could not be changed in time — kept air cooling rather than moving to liquid. This forced mechanical and thermal redesign on a tight schedule. "All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule." (Source text)
Data-center power/cooling is a harder constraint than silicon. "Data center power and cooling infrastructure cannot be changed quickly (or easily) and we had to find an optimal layout that allowed maximum compute capability within a data hall." Meta relocated supporting services (readers) out of the data hall to free power/space, and packed GPU racks as densely as possible to maximise compute and bisection-bandwidth within the largest network cluster. Data-hall layout is load-bearing; see patterns/data-center-density-optimization. (Source text)
Top failure modes at scale (observed most often in early-life hardware, settling as servers age): GPUs falling off PCIe (not detected by the host), DRAM/SRAM uncorrectable errors (monitor per-device, initiate RMA past vendor thresholds), network cable failures presenting as unreachable servers. Meta tracks repeat offenders and sometimes takes preventive remediation on early-warning signals. (Source text; see concepts/gpu-training-failure-modes)
Three network optimisations made the two 24K-GPU fabrics equally fast. (a) Assign communication patterns from different parallelism axes (DP / TP / PP) to different layers of the network topology so topology bandwidth is efficiently spent. (b) Topology-aware collectives — custom algorithms like recursive doubling/halving replacing default ring-all-reduce — to tolerate higher latency. (c) Load balancing for fat flows. GenAI training produces fat flows (large, long-lived tensor transfers) like ranking jobs do, which don't naturally distribute across ECMP paths; Meta invested further in routing + network load balancing to spread them across available paths. (Source text; see concepts/collective-communication-topology-awareness, concepts/fat-flow-load-balancing)
PyTorch as the training-software surface. Meta exposes PyTorch (and internal algorithms / tools) to researchers to keep research-to-production fast. The post does not disclose Meta's specific parallelism framework (Megatron-LM / FSDP / custom) for Llama 3, only the fabric and hardware substrate. (Source text)

Systems / hardware extracted¶

systems/grand-teton — Meta's open-sourced AI training platform, modified for H100 700 W + HBM3, kept air-cooled due to data-center-infrastructure constraints.
systems/nvidia-h100 — the GPU underlying both 24K clusters; Meta runs them at 700 W TDP (above stock).
systems/meta-genai-cluster-roce — the 24K-GPU RoCE cluster; optimised for build speed; hosted the largest Llama 3 training run.
systems/meta-genai-cluster-infiniband — the 24K-GPU InfiniBand cluster; optimised for full-bisection bandwidth.
systems/roce-rdma-over-converged-ethernet — the Ethernet-based RDMA fabric Meta matured for production AI; 2024-06 scale step was 4K → 24K GPUs.
systems/infiniband — the HPC-native RDMA fabric; scaled from the 16K-GPU AI Research SuperCluster to a production-integrated 24K-GPU GenAI cluster for Llama 3.
systems/llama-3 — trained on both clusters; largest model on RoCE.

Concepts extracted¶

concepts/training-checkpoint — efficient preservation of training state as a named first-class scaling property.
concepts/hardware-reliability-at-scale — failure rate scales with cluster size; minimising interruptions requires rigorous testing + automation + spare capacity.
concepts/gpu-training-failure-modes — GPU-falls-off-PCIe, DRAM/SRAM UCE, network-cable unreachability; early-life-biased distribution.
concepts/collective-communication-topology-awareness — replacing default (ring) collectives with topology-aware algorithms like recursive doubling/halving.
concepts/fat-flow-load-balancing — training produces long-lived, high-bandwidth flows that defeat ECMP hashing; explicit routing / load balancing is required.

Existing concepts reinforced:

concepts/3d-parallelism — Meta explicitly maps DP / TP / PP onto different layers of the network topology; the post is a Meta-side confirmation of the substrate eBay's e-Llama post described in open-source terms.
concepts/data-parallelism / concepts/pipeline-parallelism — surfaced as the axes whose communication patterns are assigned to different topology layers.

Patterns extracted¶

patterns/build-both-fabric-alternatives — when tradeoffs between two production fabrics are unclear at scale, build at-scale of both and compare operationally rather than forecasting which will win.
patterns/data-center-density-optimization — relocate non-compute services out of the GPU data hall, pack GPU racks densely within a single network cluster to maximise compute density and bisection bandwidth under fixed power/cooling.

Operational / architectural numbers¶

Datum	Value
Cluster size (each)	24,000 GPUs × 2 (RoCE + InfiniBand)
GPU	NVIDIA H100, 80 GB, HBM3
GPU TDP (Meta's config)	700 W (increased from stock)
Cooling	Air (could not change infra in time for H100 rollout)
Platform	Modified Grand Teton
Prior RoCE production scale	4,000 GPUs
Prior InfiniBand scale (research)	16,000 GPUs (AI Research SuperCluster)
Model trained on both	Llama 3
Largest model cluster	RoCE

Specific per-job utilisation, MFU, checkpoint cadence, and parallelism degrees are not disclosed in this post.

Caveats¶

The post is a tour, not a deep dive on any single layer. Specific numbers for fabric configurations (subscription ratios, switch radix, congestion control), checkpoint implementation (frequency, media, granularity), and parallelism framework (Megatron-LM vs FSDP vs internal) are not disclosed.
"Equivalent performance" between RoCE and InfiniBand is asserted without per-workload benchmarks. Meta says tuning made them equivalent on large GenAI workloads; results on other workloads (e.g. HPC collectives, non-LLM training) are not claimed.
Related Meta posts (Building Meta's GenAI Infrastructure, 2024-03-12, and Scaling RoCE Networks for AI Training, Networking @Scale 2023) are referenced but not ingested here.
The full body reveals PyTorch as the surface but does not disclose the internal parallelism framework used for Llama 3. Treat the parallelism-topology-mapping takeaway as principle rather than recipe.

Source¶

Original: https://engineering.fb.com/2024/06/12/data-infrastructure/training-large-language-models-at-scale-meta/
Raw markdown: raw/meta/2024-06-12-how-meta-trains-large-language-models-at-scale-f3fcaa00.md
HN discussion: https://news.ycombinator.com/item?id=40664339 (404 points)

companies/meta — Meta company page (distinctive systems: Presto, TAO, Haystack, Tupperware, PTP, Owl, and this post's GenAI clusters).
systems/llama-3-1 — Llama 3.1, the successor family; succeeds the Llama 3 trained on this substrate.
systems/nvlink — the intra-node interconnect paired with RoCE/InfiniBand for inter-node in both 24K clusters.
systems/megatron-lm — the open-source framework most commonly orchestrating H100 clusters at this scale; Meta does not disclose whether they use it for Llama 3.