Skip to content

CONCEPT Cited by 1 source

FSDP (Fully Sharded Data Parallel)

Definition

FSDP (Fully Sharded Data Parallel) is a PyTorch-native distributed-training strategy that shards parameters, gradients, and optimizer state across all data-parallel workers, materialising full tensors only when needed for computation. It is the mainstream approach to training LLMs that don't fit in a single GPU's memory, sitting alongside tensor parallelism and pipeline parallelism in the 3D-parallelism design space.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix discusses long-tail sequence-length stragglers under FSDP and the on-the-fly sequence packing mitigation.

What FSDP shards

Unlike plain data parallel (DP) which replicates the full model on every worker:

  • Parameters: each worker holds a shard; all-gather when needed for forward/backward.
  • Gradients: reduce-scatter so each worker holds gradient shards for its parameter shards.
  • Optimizer state: sharded alongside the parameter shards.

Result: peak GPU memory is ~(model_bytes + optimizer_bytes) / world_size rather than (model_bytes + optimizer_bytes) per worker.

Why FSDP creates straggler problems for variable-length sequences

Per the Netflix post:

"In FSDP-style training, long-tail sequences create stragglers: faster workers end up waiting at synchronization points for the slowest batch, lowering utilization." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

The mechanism:

  1. FSDP is an SPMD model — every worker runs the same step function.
  2. Each step has well-defined synchronisation points (all-gather of params, reduce-scatter of grads).
  3. If one worker's batch contains an unusually long sequence, that worker takes longer to finish forward/backward.
  4. All faster workers block at the sync point waiting for the slowest one.

Even if average sequence length is moderate, the tail of the length distribution dictates step latency — a classic straggler pattern.

Mitigations

  • Offline bin-packing: pre-compute which samples to pack into fixed-length sequences before training starts. Good for small datasets; at Netflix's data scale it "adds substantial preprocessing latency and makes it harder to keep datasets fresh."
  • On-the-fly async sequence packing (Netflix's choice): stream samples from storage, dynamically pack them in memory with document masks, run packing asynchronously to overlap CPU work with GPU compute. Up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset (Figure 5, A100 + H200).

Interaction with other parallelism dimensions

FSDP is typically combined with concepts/tensor-parallelism (for very large layers) and pipeline parallelism (for very deep models) in 3D-parallelism configurations. Netflix's framework offers high-level sharding APIs so developers can distribute large models across device meshes without writing low-level distributed code.

Last updated · 550 distilled / 1,221 read