Skip to content

CONCEPT Cited by 1 source

3D parallelism

Definition

3D parallelism is the standard shape for training billion-plus-parameter language models at multi-node GPU-cluster scale: compose data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) into a single training job. Each axis solves a different scaling bottleneck; composing them lets the training frontier move along all three.

Together with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers) + activation checkpointing + flash-attention-2 + etc., 3D parallelism is what makes frontier LLM training feasible on realistic hardware.

The three axes

Axis What's split across GPUs Communication per step Typical placement
Data parallel (DP) the data (model is replicated) all-reduce of gradients once per step across the whole cluster
Tensor parallel (TP) individual weight matrices (sliced) all-reduce / all-gather per layer within a node, NVLink domain
Pipeline parallel (PP) transformer layers (different GPUs hold different layers) point-to-point between stage boundaries across nodes, InfiniBand domain

Add Expert parallelism (EP) for MoE models — different experts on different GPUs — bringing the axis count to four in practice.

Why each axis is needed

  • DP alone is insufficient at frontier scale: the full model must fit on one GPU for DP to be pure data-parallelism, and a 70B+ model does not.
  • TP alone is insufficient: TP's per-layer all-reduce is bandwidth-heavy and only works efficiently within the NVLink domain (typically 8 GPUs). You can't TP-8 a 70B model across 60 nodes directly.
  • PP alone is insufficient: PP's pipeline bubble eats utilisation unless you have enough micro-batches to fill it; at continued-pretraining batch sizes this is workable but limits the axis alone to be the whole answer.
  • Combined: TP=8 fills a node's NVLink domain, PP spans groups of nodes on InfiniBand, DP fills whatever remains to saturate the cluster. The product TP × PP × DP = total GPUs.

Canonical wiki reference

eBay's e-Llama continued pretraining (2025-01-17) names 3D parallelism explicitly as the training shape:

"Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes. We use Megatron-LM as a highly-optimized training framework that allows us to use 3D parallelism in training (data parallel (DP), tensor parallel (TP), pipeline parallel (PP) as well as distributed optimizer states, flash-attention-2, among other optimizations." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Running on 60 nodes × 8 × H100 = 480 GPUs with intra-node NVLink + inter-node InfiniBand, orchestrated by Megatron-LM. The specific DP × TP × PP degrees are not disclosed, but a plausible shape for 70B/480 GPUs is TP=8 / PP=8 / DP=~7-8.

How the axes map to the hardware

  • NVLink domain (intra-node, hundreds of GB/s) → tensor parallelism, because per-layer all-reduce bandwidth is the bottleneck.
  • InfiniBand domain (inter-node, tens of GB/s) → pipeline parallelism, because stage-boundary point-to-point is less bandwidth-sensitive than per-layer all-reduce.
  • Whole cluster → data parallelism, because once-per-step all-reduce of gradients is the least frequent collective.

This hardware-to-axis mapping is the architectural lens for understanding a training deployment: "What's the NVLink domain? What's the InfiniBand topology? How do the parallelism degrees factor over that substrate?"

Design considerations

  • TP degree ≤ intra-node GPU count for NVLink-bound efficiency.
  • PP stage count × micro-batch count ≥ enough to amortise the pipeline bubble.
  • DP degree × per-GPU micro-batch = global batch size; scale DP conservatively to avoid losing convergence at very large global batch.
  • Distributed optimizer (ZeRO-style) shards optimizer state across DP workers — essential for fitting large models at realistic DP; trade-off is an extra reduce-scatter of gradients.
  • Sequence parallelism can be added to TP to further reduce memory per GPU at the cost of more communication.

Relationship to inference

3D parallelism is fundamentally a training-side shape. Inference-side parallelism is a related but different problem:

  • Inference has smaller effective batch sizes → pipeline bubble is worse → PP is less attractive.
  • Inference is latency-sensitive → TP remains valuable but is more often paired with continuous batching than with PP.
  • See concepts/multi-gpu-serving for the inference-side discussion.

Caveats

  • Axes compose but costs compose too. More parallelism axes = more collective types = more failure surface. Checkpointing and restart logic at 480-GPU scale must handle all three.
  • Framework choice is load-bearing. Megatron-LM / DeepSpeed / FSDP each have different opinions on how to compose DP / TP / PP; picking one locks in behaviour.
  • Partition-degree space is combinatorial. Tuning TP × PP × DP for a new model / hardware pair involves a small-scale sweep similar to LR tuning.

Seen in

Last updated · 200 distilled / 1,178 read