Skip to content

CONCEPT Cited by 1 source

Data parallelism

Definition

Data parallelism (DP) is the simplest axis of distributed training: replicate the model across N workers, shard the mini-batch into N pieces, each worker computes its piece of the gradient, then all-reduce the gradients across workers so all replicas stay in sync. Each worker holds a full copy of the model; the communication per step is one gradient all-reduce.

It's the baseline axis; TP and PP exist because pure DP breaks down when the model stops fitting on one GPU.

Canonical wiki reference

eBay's e-Llama training uses DP as one axis of concepts/3d-parallelism|3D parallelism on 480 H100 GPUs, orchestrated by Megatron-LM with distributed optimizer states (ZeRO-style sharding of Adam moments across DP workers, trading a reduce-scatter + all-gather of gradients for ~N× reduction in per-GPU optimizer-state memory). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

How DP composes with TP + PP

In a 3D-parallel job, total GPUs = DP × TP × PP. Each DP replica is itself a (TP × PP) model-sharded unit. The gradient all-reduce runs across DP replicas, not across TP or PP partitions. Practical shape at 70B / 480 GPUs:

  • TP = 8 (fills the NVLink intra-node domain).
  • PP = 8 (spans groups of nodes across InfiniBand).
  • DP = 480 / (8 × 8) = 7.5 — doesn't factor evenly; actual eBay DP degree not disclosed.

Distributed optimizer (ZeRO-style)

Pure DP replicates everything — model weights, gradients, optimizer state — on every DP worker. Optimizer state (Adam m + v buffers) alone is ~2× parameter count in fp32; for a 70B model that's ~280GB of optimizer state per replica. ZeRO (and Megatron-LM's distributed optimizer) shard this across DP workers so each holds 1/N_DP of the optimizer state. The communication cost is a reduce-scatter + all-gather per step in place of the basic all-reduce; the memory savings are typically worth it at large DP.

eBay's recipe names "distributed optimizer states" as part of the stack — the specific ZeRO stage is not disclosed.

Caveats

  • Global batch size grows with DP degree; at very large DP you may hit the "critical batch size" above which convergence quality degrades. Has to be balanced against available hardware.
  • DP is the cheapest axis to scale when the model fits; becomes the bottleneck when the model doesn't.
  • Communication-bound at large scale. Gradient all-reduce bandwidth × latency determines how much of the step is spent communicating. InfiniBand topology and NCCL tuning both matter.

Seen in

Last updated · 200 distilled / 1,178 read