CONCEPT Cited by 1 source
FSDP (Fully Sharded Data Parallel)¶
Definition¶
FSDP (Fully Sharded Data Parallel) is a PyTorch-native distributed-training strategy that shards parameters, gradients, and optimizer state across all data-parallel workers, materialising full tensors only when needed for computation. It is the mainstream approach to training LLMs that don't fit in a single GPU's memory, sitting alongside tensor parallelism and pipeline parallelism in the 3D-parallelism design space.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix discusses long-tail sequence-length stragglers under FSDP and the on-the-fly sequence packing mitigation.
What FSDP shards¶
Unlike plain data parallel (DP) which replicates the full model on every worker:
- Parameters: each worker holds a shard; all-gather when needed for forward/backward.
- Gradients: reduce-scatter so each worker holds gradient shards for its parameter shards.
- Optimizer state: sharded alongside the parameter shards.
Result: peak GPU memory is ~(model_bytes + optimizer_bytes) / world_size rather than (model_bytes + optimizer_bytes) per worker.
Why FSDP creates straggler problems for variable-length sequences¶
Per the Netflix post:
"In FSDP-style training, long-tail sequences create stragglers: faster workers end up waiting at synchronization points for the slowest batch, lowering utilization." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
The mechanism:
- FSDP is an SPMD model — every worker runs the same step function.
- Each step has well-defined synchronisation points (all-gather of params, reduce-scatter of grads).
- If one worker's batch contains an unusually long sequence, that worker takes longer to finish forward/backward.
- All faster workers block at the sync point waiting for the slowest one.
Even if average sequence length is moderate, the tail of the length distribution dictates step latency — a classic straggler pattern.
Mitigations¶
- Offline bin-packing: pre-compute which samples to pack into fixed-length sequences before training starts. Good for small datasets; at Netflix's data scale it "adds substantial preprocessing latency and makes it harder to keep datasets fresh."
- On-the-fly async sequence packing (Netflix's choice): stream samples from storage, dynamically pack them in memory with document masks, run packing asynchronously to overlap CPU work with GPU compute. Up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset (Figure 5, A100 + H200).
Interaction with other parallelism dimensions¶
FSDP is typically combined with concepts/tensor-parallelism (for very large layers) and pipeline parallelism (for very deep models) in 3D-parallelism configurations. Netflix's framework offers high-level sharding APIs so developers can distribute large models across device meshes without writing low-level distributed code.