CONCEPT Cited by 2 sources

Data parallelism¶

Definition¶

Data parallelism (DP) is the simplest axis of distributed training: replicate the model across N workers, shard the mini-batch into N pieces, each worker computes its piece of the gradient, then all-reduce the gradients across workers so all replicas stay in sync. Each worker holds a full copy of the model; the communication per step is one gradient all-reduce.

It's the baseline axis; TP and PP exist because pure DP breaks down when the model stops fitting on one GPU.

Canonical wiki reference¶

eBay's e-Llama training uses DP as one axis of concepts/3d-parallelism|3D parallelism on 480 H100 GPUs, orchestrated by Megatron-LM with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers, trading a reduce-scatter + all-gather of gradients for ~N× reduction in per-GPU optimizer-state memory). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

How DP composes with TP + PP¶

In a 3D-parallel job, total GPUs = DP × TP × PP. Each DP replica is itself a (TP × PP) model-sharded unit. The gradient all-reduce runs across DP replicas, not across TP or PP partitions. Practical shape at 70B / 480 GPUs:

TP = 8 (fills the NVLink intra-node domain).
PP = 8 (spans groups of nodes across InfiniBand).
DP = 480 / (8 × 8) = 7.5 — doesn't factor evenly; actual eBay DP degree not disclosed.

Distributed optimizer (ZeRO-style)¶

Pure DP replicates everything — model weights, gradients, optimizer state — on every DP worker. Optimizer state (Adam m + v buffers) alone is ~2× parameter count in fp32; for a 70B model that's ~280GB of optimizer state per replica. ZeRO (and Megatron-LM's distributed optimizer) shard this across DP workers so each holds 1/N_DP of the optimizer state. The communication cost is a reduce-scatter + all-gather per step in place of the basic all-reduce; the memory savings are typically worth it at large DP.

eBay's recipe names "distributed optimizer states" as part of the stack — the specific ZeRO stage is not disclosed.

Caveats¶

Global batch size grows with DP degree; at very large DP you may hit the "critical batch size" above which convergence quality degrades. Has to be balanced against available hardware.
DP is the cheapest axis to scale when the model fits; becomes the bottleneck when the model doesn't.
Communication-bound at large scale. Gradient all-reduce bandwidth × latency determines how much of the step is spent communicating. InfiniBand topology and NCCL tuning both matter.

Seen in¶

sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — DP as one axis of the 3D-parallel eBay e-Llama training run, with distributed optimizer states for memory efficiency.
sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta names DP explicitly as one of the parallelism axes whose communication patterns are mapped to specific layers of the 24K-GPU cluster network topology. The gradient all-reduce is the canonical case where fat-flow load balancing and topology-aware collectives both matter most.

concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/3d-parallelism
concepts/multi-gpu-serving — inference-side counterpart.
systems/megatron-lm — framework composing DP with TP/PP.
systems/e-llama