CONCEPT Cited by 1 source
3D parallelism¶
Definition¶
3D parallelism is the standard shape for training billion-plus-parameter language models at multi-node GPU-cluster scale: compose data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) into a single training job. Each axis solves a different scaling bottleneck; composing them lets the training frontier move along all three.
Together with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers) + activation checkpointing + flash-attention-2 + etc., 3D parallelism is what makes frontier LLM training feasible on realistic hardware.
The three axes¶
| Axis | What's split across GPUs | Communication per step | Typical placement |
|---|---|---|---|
| Data parallel (DP) | the data (model is replicated) | all-reduce of gradients once per step | across the whole cluster |
| Tensor parallel (TP) | individual weight matrices (sliced) | all-reduce / all-gather per layer | within a node, NVLink domain |
| Pipeline parallel (PP) | transformer layers (different GPUs hold different layers) | point-to-point between stage boundaries | across nodes, InfiniBand domain |
Add Expert parallelism (EP) for MoE models — different experts on different GPUs — bringing the axis count to four in practice.
Why each axis is needed¶
- DP alone is insufficient at frontier scale: the full model must fit on one GPU for DP to be pure data-parallelism, and a 70B+ model does not.
- TP alone is insufficient: TP's per-layer all-reduce is bandwidth-heavy and only works efficiently within the NVLink domain (typically 8 GPUs). You can't TP-8 a 70B model across 60 nodes directly.
- PP alone is insufficient: PP's pipeline bubble eats utilisation unless you have enough micro-batches to fill it; at continued-pretraining batch sizes this is workable but limits the axis alone to be the whole answer.
- Combined: TP=8 fills a node's NVLink domain, PP spans groups of nodes on InfiniBand, DP fills whatever remains to saturate the cluster. The product TP × PP × DP = total GPUs.
Canonical wiki reference¶
eBay's e-Llama continued pretraining (2025-01-17) names 3D parallelism explicitly as the training shape:
"Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes. We use Megatron-LM as a highly-optimized training framework that allows us to use 3D parallelism in training (data parallel (DP), tensor parallel (TP), pipeline parallel (PP) as well as distributed optimizer states, flash-attention-2, among other optimizations." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
Running on 60 nodes × 8 × H100 = 480 GPUs with intra-node NVLink + inter-node InfiniBand, orchestrated by Megatron-LM. The specific DP × TP × PP degrees are not disclosed, but a plausible shape for 70B/480 GPUs is TP=8 / PP=8 / DP=~7-8.
How the axes map to the hardware¶
- NVLink domain (intra-node, hundreds of GB/s) → tensor parallelism, because per-layer all-reduce bandwidth is the bottleneck.
- InfiniBand domain (inter-node, tens of GB/s) → pipeline parallelism, because stage-boundary point-to-point is less bandwidth-sensitive than per-layer all-reduce.
- Whole cluster → data parallelism, because once-per-step all-reduce of gradients is the least frequent collective.
This hardware-to-axis mapping is the architectural lens for understanding a training deployment: "What's the NVLink domain? What's the InfiniBand topology? How do the parallelism degrees factor over that substrate?"
Design considerations¶
- TP degree ≤ intra-node GPU count for NVLink-bound efficiency.
- PP stage count × micro-batch count ≥ enough to amortise the pipeline bubble.
- DP degree × per-GPU micro-batch = global batch size; scale DP conservatively to avoid losing convergence at very large global batch.
- Distributed optimizer (ZeRO-style) shards optimizer state across DP workers — essential for fitting large models at realistic DP; trade-off is an extra reduce-scatter of gradients.
- Sequence parallelism can be added to TP to further reduce memory per GPU at the cost of more communication.
Relationship to inference¶
3D parallelism is fundamentally a training-side shape. Inference-side parallelism is a related but different problem:
- Inference has smaller effective batch sizes → pipeline bubble is worse → PP is less attractive.
- Inference is latency-sensitive → TP remains valuable but is more often paired with continuous batching than with PP.
- See concepts/multi-gpu-serving for the inference-side discussion.
Caveats¶
- Axes compose but costs compose too. More parallelism axes = more collective types = more failure surface. Checkpointing and restart logic at 480-GPU scale must handle all three.
- Framework choice is load-bearing. Megatron-LM / DeepSpeed / FSDP each have different opinions on how to compose DP / TP / PP; picking one locks in behaviour.
- Partition-degree space is combinatorial. Tuning TP × PP × DP for a new model / hardware pair involves a small-scale sweep similar to LR tuning.
Seen in¶
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — eBay e-Llama 8B + 70B on 480 H100s, Megatron-LM orchestrating DP + TP + PP + distributed optimizer + flash-attention-2; canonical wiki reference for the full 3D-parallelism training shape.
Related¶
- concepts/data-parallelism / concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/expert-parallelism
- concepts/multi-gpu-serving — inference-side counterpart.
- concepts/continued-pretraining — the training regime in which eBay's 3D-parallel recipe is applied.
- systems/megatron-lm — the framework orchestrating the eBay 3D-parallel run.
- systems/nvidia-h100 / systems/nvlink / systems/infiniband — the hardware substrate.
- systems/e-llama — the model trained with 3D parallelism in the canonical wiki reference.