CONCEPT Cited by 1 source
Data parallelism¶
Definition¶
Data parallelism (DP) is the simplest axis of distributed training: replicate the model across N workers, shard the mini-batch into N pieces, each worker computes its piece of the gradient, then all-reduce the gradients across workers so all replicas stay in sync. Each worker holds a full copy of the model; the communication per step is one gradient all-reduce.
It's the baseline axis; TP and PP exist because pure DP breaks down when the model stops fitting on one GPU.
Canonical wiki reference¶
eBay's e-Llama training uses DP as one axis of concepts/3d-parallelism|3D parallelism on 480 H100 GPUs, orchestrated by Megatron-LM with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers, trading a reduce-scatter + all-gather of gradients for ~N× reduction in per-GPU optimizer-state memory). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
How DP composes with TP + PP¶
In a 3D-parallel job, total GPUs = DP × TP × PP. Each DP replica is itself a (TP × PP) model-sharded unit. The gradient all-reduce runs across DP replicas, not across TP or PP partitions. Practical shape at 70B / 480 GPUs:
- TP = 8 (fills the NVLink intra-node domain).
- PP = 8 (spans groups of nodes across InfiniBand).
- DP = 480 / (8 × 8) = 7.5 — doesn't factor evenly; actual eBay DP degree not disclosed.
Distributed optimizer (ZeRO-style)¶
Pure DP replicates everything — model weights, gradients, optimizer state — on every DP worker. Optimizer state (Adam m + v buffers) alone is ~2× parameter count in fp32; for a 70B model that's ~280GB of optimizer state per replica. ZeRO (and Megatron-LM's distributed optimizer) shard this across DP workers so each holds 1/N_DP of the optimizer state. The communication cost is a reduce-scatter + all-gather per step in place of the basic all-reduce; the memory savings are typically worth it at large DP.
eBay's recipe names "distributed optimizer states" as part of the stack — the specific ZeRO stage is not disclosed.
Caveats¶
- Global batch size grows with DP degree; at very large DP you may hit the "critical batch size" above which convergence quality degrades. Has to be balanced against available hardware.
- DP is the cheapest axis to scale when the model fits; becomes the bottleneck when the model doesn't.
- Communication-bound at large scale. Gradient all-reduce bandwidth × latency determines how much of the step is spent communicating. InfiniBand topology and NCCL tuning both matter.
Seen in¶
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — DP as one axis of the 3D-parallel eBay e-Llama training run, with distributed optimizer states for memory efficiency.
Related¶
- concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/3d-parallelism
- concepts/multi-gpu-serving — inference-side counterpart.
- systems/megatron-lm — framework composing DP with TP/PP.
- systems/e-llama