Skip to content

SYSTEM Cited by 1 source

Megatron-LM

Megatron-LM is NVIDIA's open-source, highly-optimised framework for training transformer-based LLMs at multi-node scale. Its defining feature is first-class support for concepts/3d-parallelism|3D parallelism — composing data parallel (DP), tensor parallel (TP) and pipeline parallel (PP) into a single training job — plus distributed optimizer states, flash-attention-2 integration, and other throughput / memory optimizations.

Seen in (wiki)

What Megatron-LM does

At training-cluster scale, a billion-plus-parameter model exceeds the memory of any single GPU and the aggregate FLOPs of any single node. Megatron-LM's job is to partition model + optimizer + activations across GPUs and across nodes while keeping the forward/backward pass coherent and the hardware utilised:

  • Data parallelism replicates the model across workers and shards the mini-batch. All-reduce gradients at each step.
  • Tensor parallelism splits individual weight matrices across GPUs. All-reduce / all-gather activations per layer.
  • Pipeline parallelism places different layers on different GPUs and streams microbatches through like a conveyor belt.
  • Distributed optimizer states (ZeRO-style) shards the Adam-style optimizer state across data-parallel workers so each worker holds 1/N of the state rather than a full copy — the memory-capacity headroom that makes 70B training fit at a given GPU count.
  • Flash-attention-2 integration swaps in the IO-aware attention kernel for the standard implementation, cutting attention memory from O(N²) to O(N) and speeding up the dominant cost of long-context training.

Why Megatron-LM for continued pretraining

The e-Llama post frames the framework choice as a scale requirement:

"Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes. We use Megatron-LM as a highly-optimized training framework that allows us to use 3D parallelism in training (data parallel (DP), tensor parallel (TP), pipeline parallel (PP) as well as distributed optimizer states, flash-attention-2, among other optimizations."

In other words: at 480 GPUs / 70B parameters / 1T tokens, framework choice is load-bearing for training wall-clock and GPU utilisation. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

What's not disclosed (from the eBay source)

  • The actual DP × TP × PP degrees used across the 480 GPUs (a common choice at this scale is TP=8 within a node, PP=4-8 across node groups, DP fills the rest).
  • Checkpoint cadence + resume-from-failure mechanics.
  • Communication algorithm / collective choice (NCCL version, ring vs tree all-reduce, overlap-with-compute settings).
  • MFU (Model FLOPs Utilisation) or tokens-per-GPU-per-second numbers.
Last updated · 200 distilled / 1,178 read