CONCEPT Cited by 2 sources

3D parallelism¶

Definition¶

3D parallelism is the standard shape for training billion-plus-parameter language models at multi-node GPU-cluster scale: compose data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) into a single training job. Each axis solves a different scaling bottleneck; composing them lets the training frontier move along all three.

Together with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers) + activation checkpointing + flash-attention-2 + etc., 3D parallelism is what makes frontier LLM training feasible on realistic hardware.

The three axes¶

Axis	What's split across GPUs	Communication per step	Typical placement
Data parallel (DP)	the data (model is replicated)	all-reduce of gradients once per step	across the whole cluster
Tensor parallel (TP)	individual weight matrices (sliced)	all-reduce / all-gather per layer	within a node, NVLink domain
Pipeline parallel (PP)	transformer layers (different GPUs hold different layers)	point-to-point between stage boundaries	across nodes, InfiniBand domain

Add Expert parallelism (EP) for MoE models — different experts on different GPUs — bringing the axis count to four in practice.

Why each axis is needed¶

DP alone is insufficient at frontier scale: the full model must fit on one GPU for DP to be pure data-parallelism, and a 70B+ model does not.
TP alone is insufficient: TP's per-layer all-reduce is bandwidth-heavy and only works efficiently within the NVLink domain (typically 8 GPUs). You can't TP-8 a 70B model across 60 nodes directly.
PP alone is insufficient: PP's pipeline bubble eats utilisation unless you have enough micro-batches to fill it; at continued-pretraining batch sizes this is workable but limits the axis alone to be the whole answer.
Combined: TP=8 fills a node's NVLink domain, PP spans groups of nodes on InfiniBand, DP fills whatever remains to saturate the cluster. The product TP × PP × DP = total GPUs.

Canonical wiki reference¶

eBay's e-Llama continued pretraining (2025-01-17) names 3D parallelism explicitly as the training shape:

"Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes. We use Megatron-LM as a highly-optimized training framework that allows us to use 3D parallelism in training (data parallel (DP), tensor parallel (TP), pipeline parallel (PP) as well as distributed optimizer states, flash-attention-2, among other optimizations." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Running on 60 nodes × 8 × H100 = 480 GPUs with intra-node NVLink + inter-node InfiniBand, orchestrated by Megatron-LM. The specific DP × TP × PP degrees are not disclosed, but a plausible shape for 70B/480 GPUs is TP=8 / PP=8 / DP=~7-8.

How the axes map to the hardware¶

NVLink domain (intra-node, hundreds of GB/s) → tensor parallelism, because per-layer all-reduce bandwidth is the bottleneck.
InfiniBand domain (inter-node, tens of GB/s) → pipeline parallelism, because stage-boundary point-to-point is less bandwidth-sensitive than per-layer all-reduce.
Whole cluster → data parallelism, because once-per-step all-reduce of gradients is the least frequent collective.

This hardware-to-axis mapping is the architectural lens for understanding a training deployment: "What's the NVLink domain? What's the InfiniBand topology? How do the parallelism degrees factor over that substrate?"

Design considerations¶

TP degree ≤ intra-node GPU count for NVLink-bound efficiency.
PP stage count × micro-batch count ≥ enough to amortise the pipeline bubble.
DP degree × per-GPU micro-batch = global batch size; scale DP conservatively to avoid losing convergence at very large global batch.
Distributed optimizer (ZeRO-style) shards optimizer state across DP workers — essential for fitting large models at realistic DP; trade-off is an extra reduce-scatter of gradients.
Sequence parallelism can be added to TP to further reduce memory per GPU at the cost of more communication.

Relationship to inference¶

3D parallelism is fundamentally a training-side shape. Inference-side parallelism is a related but different problem:

Inference has smaller effective batch sizes → pipeline bubble is worse → PP is less attractive.
Inference is latency-sensitive → TP remains valuable but is more often paired with continuous batching than with PP.
See concepts/multi-gpu-serving for the inference-side discussion.

Caveats¶

Axes compose but costs compose too. More parallelism axes = more collective types = more failure surface. Checkpointing and restart logic at 480-GPU scale must handle all three.
Framework choice is load-bearing. Megatron-LM / DeepSpeed / FSDP each have different opinions on how to compose DP / TP / PP; picking one locks in behaviour.
Partition-degree space is combinatorial. Tuning TP × PP × DP for a new model / hardware pair involves a small-scale sweep similar to LR tuning.

Seen in¶

sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — eBay e-Llama 8B + 70B on 480 H100s, Megatron-LM orchestrating DP + TP + PP + distributed optimizer + flash-attention-2; canonical wiki reference for the full 3D-parallelism training shape.
sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta's 2024-06-12 GenAI infra post names the parallelism-axis → topology-layer mapping principle explicitly: "We assigned communication patterns resulting from different model, data and pipeline parallelisms to different layers of the network topology." Independent Meta-side confirmation of the hardware-to-axis mapping framing. Deployment: 24K-GPU × 2 H100 clusters (RoCE + InfiniBand) for Llama 3.

concepts/data-parallelism / concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/expert-parallelism
concepts/multi-gpu-serving — inference-side counterpart.
concepts/continued-pretraining — the training regime in which eBay's 3D-parallel recipe is applied.
systems/megatron-lm — the framework orchestrating the eBay 3D-parallel run.
systems/nvidia-h100 / systems/nvlink / systems/infiniband — the hardware substrate.
systems/e-llama — the model trained with 3D parallelism in the canonical wiki reference.