CONCEPT Cited by 1 source
SPMD execution model¶
Definition¶
SPMD (Single Program, Multiple Data) is a distributed-execution model in which every worker runs the same step function on a different shard of the data, synchronising through collective primitives (all-reduce, all-gather, reduce-scatter) at well-defined points in each step. It is the dominant execution model for neural-network pre-training and supervised fine-tuning because the learning signal is dense, immediate, and differentiable at each worker independently.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix uses the SPMD/single-controller contrast to explain why their internal Post-Training Framework had to evolve from pure SPMD to a hybrid execution model when adding RL support.
Shape of the computation¶
In a canonical SPMD SFT step, for each GPU worker g in a cluster of G:
shard_g = data_loader.next() # different data on each worker
logits_g = model(shard_g) # same forward pass on every worker
loss_g = loss_fn(logits_g, ...) # same loss fn on every worker
grads_g = loss_g.backward() # same backward pass
sync(grads_g, grads_{0..G-1}) # collective: all-reduce
optimizer.step()
Every GPU runs the same program; the differences are isolated to:
- The shard of data each one sees.
- The shard of parameters / gradients / optimizer state each one holds (under FSDP or tensor parallelism).
Coordination is implicit through synchronous collectives. There is no "control node" deciding the schedule — every worker independently executes the next step.
Why SPMD fits SFT naturally¶
SFT inherits its signal shape from pre-training (concepts/on-policy-rl-vs-sft-signal-shape):
- Dense per-token loss — every position contributes, so each batch of shard-local data produces a complete gradient contribution.
- Differentiable end-to-end —
loss → backward → gradsin one straight chain. - No cross-phase coordination — each step is independent of what other workers' next-step decisions will be.
Under these conditions, the SPMD mental model (plus PyTorch distributed primitives) is simple, scalable, and fault-tolerant: launch more identical Ray actors / workers and the system scales.
Why SPMD breaks down for on-policy RL¶
Per Netflix:
"In our original SFT architecture, the driver node was intentionally 'thin': it launched N identical Ray actors, each encapsulating the full training loop, and scaling meant launching more identical workers. That model breaks down for RL." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
On-policy RL decomposes into distinct roles (Policy, Rollout Workers, Reward Model, Reference Model) whose end-to-end coordination is not expressible as a single step function every worker runs in lockstep. The system needs an active controller (see concepts/single-controller-rl-orchestration) to decide when to generate rollouts, how to batch+score them, when to trigger optimisation, and how to reallocate GPU resources across phases.
Netflix's framework resolves this with a hybrid execution model (patterns/hybrid-single-controller-plus-spmd-rl): a single-controller orchestration layer at the top (handled by Verl's Ray-actor lifecycle backend) with SPMD sub-stages (rollout, reward, reference, policy update) running underneath.
When to use SPMD¶
- ✅ Pre-training, SFT, DPO, and knowledge distillation — any workload whose step function is identical across workers and whose signal is dense.
- ❌ On-policy RL with separate Policy / Rollout / Reward / Reference roles — use single-controller over SPMD.
Related¶
- sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix
- concepts/single-controller-rl-orchestration
- concepts/on-policy-rl-vs-sft-signal-shape
- concepts/fsdp-fully-sharded-data-parallel
- systems/pytorch
- systems/ray
- systems/netflix-post-training-framework
- patterns/hybrid-single-controller-plus-spmd-rl