SYSTEM Cited by 1 source
Megatron-LM¶
Megatron-LM is NVIDIA's open-source, highly-optimised framework for training transformer-based LLMs at multi-node scale. Its defining feature is first-class support for concepts/3d-parallelism|3D parallelism — composing data parallel (DP), tensor parallel (TP) and pipeline parallel (PP) into a single training job — plus distributed optimizer states, flash-attention-2 integration, and other throughput / memory optimizations.
Seen in (wiki)¶
- eBay e-Llama training. eBay trained the 8B + 70B e-Llama models on 60 nodes × 8 × H100 80GB = 480 GPUs, intra-node NVLink + inter-node InfiniBand, using Megatron-LM as the orchestration framework for 3D parallelism + distributed optimizer + flash-attention-2. 1 trillion tokens trained in ~1 month on the 70B model (~340k GPU-hours). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
What Megatron-LM does¶
At training-cluster scale, a billion-plus-parameter model exceeds the memory of any single GPU and the aggregate FLOPs of any single node. Megatron-LM's job is to partition model + optimizer + activations across GPUs and across nodes while keeping the forward/backward pass coherent and the hardware utilised:
- Data parallelism replicates the model across workers and shards the mini-batch. All-reduce gradients at each step.
- Tensor parallelism splits individual weight matrices across GPUs. All-reduce / all-gather activations per layer.
- Pipeline parallelism places different layers on different GPUs and streams microbatches through like a conveyor belt.
- Distributed optimizer states (ZeRO-style) shards the Adam-style optimizer state across data-parallel workers so each worker holds 1/N of the state rather than a full copy — the memory-capacity headroom that makes 70B training fit at a given GPU count.
- Flash-attention-2 integration swaps in the IO-aware attention kernel for the standard implementation, cutting attention memory from O(N²) to O(N) and speeding up the dominant cost of long-context training.
Why Megatron-LM for continued pretraining¶
The e-Llama post frames the framework choice as a scale requirement:
"Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes. We use Megatron-LM as a highly-optimized training framework that allows us to use 3D parallelism in training (data parallel (DP), tensor parallel (TP), pipeline parallel (PP) as well as distributed optimizer states, flash-attention-2, among other optimizations."
In other words: at 480 GPUs / 70B parameters / 1T tokens, framework choice is load-bearing for training wall-clock and GPU utilisation. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
What's not disclosed (from the eBay source)¶
- The actual DP × TP × PP degrees used across the 480 GPUs (a common choice at this scale is TP=8 within a node, PP=4-8 across node groups, DP fills the rest).
- Checkpoint cadence + resume-from-failure mechanics.
- Communication algorithm / collective choice (NCCL version, ring vs tree all-reduce, overlap-with-compute settings).
- MFU (Model FLOPs Utilisation) or tokens-per-GPU-per-second numbers.
Related¶
- systems/flash-attention-2 — the attention kernel Megatron-LM integrates.
- systems/e-llama — the eBay model trained on Megatron-LM.
- systems/nvidia-h100 / systems/nvlink / systems/infiniband — the hardware substrate Megatron-LM orchestrates.
- concepts/3d-parallelism / concepts/data-parallelism / concepts/tensor-parallelism / concepts/pipeline-parallelism — the parallelism axes Megatron-LM composes.
- concepts/continued-pretraining — the training technique Megatron-LM enables at scale.