Skip to content

CONCEPT Cited by 1 source

SPMD execution model

Definition

SPMD (Single Program, Multiple Data) is a distributed-execution model in which every worker runs the same step function on a different shard of the data, synchronising through collective primitives (all-reduce, all-gather, reduce-scatter) at well-defined points in each step. It is the dominant execution model for neural-network pre-training and supervised fine-tuning because the learning signal is dense, immediate, and differentiable at each worker independently.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix uses the SPMD/single-controller contrast to explain why their internal Post-Training Framework had to evolve from pure SPMD to a hybrid execution model when adding RL support.

Shape of the computation

In a canonical SPMD SFT step, for each GPU worker g in a cluster of G:

shard_g         = data_loader.next()      # different data on each worker
logits_g        = model(shard_g)          # same forward pass on every worker
loss_g          = loss_fn(logits_g, ...)  # same loss fn on every worker
grads_g         = loss_g.backward()        # same backward pass
sync(grads_g, grads_{0..G-1})             # collective: all-reduce
optimizer.step()

Every GPU runs the same program; the differences are isolated to:

  1. The shard of data each one sees.
  2. The shard of parameters / gradients / optimizer state each one holds (under FSDP or tensor parallelism).

Coordination is implicit through synchronous collectives. There is no "control node" deciding the schedule — every worker independently executes the next step.

Why SPMD fits SFT naturally

SFT inherits its signal shape from pre-training (concepts/on-policy-rl-vs-sft-signal-shape):

  • Dense per-token loss — every position contributes, so each batch of shard-local data produces a complete gradient contribution.
  • Differentiable end-to-endloss → backward → grads in one straight chain.
  • No cross-phase coordination — each step is independent of what other workers' next-step decisions will be.

Under these conditions, the SPMD mental model (plus PyTorch distributed primitives) is simple, scalable, and fault-tolerant: launch more identical Ray actors / workers and the system scales.

Why SPMD breaks down for on-policy RL

Per Netflix:

"In our original SFT architecture, the driver node was intentionally 'thin': it launched N identical Ray actors, each encapsulating the full training loop, and scaling meant launching more identical workers. That model breaks down for RL." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

On-policy RL decomposes into distinct roles (Policy, Rollout Workers, Reward Model, Reference Model) whose end-to-end coordination is not expressible as a single step function every worker runs in lockstep. The system needs an active controller (see concepts/single-controller-rl-orchestration) to decide when to generate rollouts, how to batch+score them, when to trigger optimisation, and how to reallocate GPU resources across phases.

Netflix's framework resolves this with a hybrid execution model (patterns/hybrid-single-controller-plus-spmd-rl): a single-controller orchestration layer at the top (handled by Verl's Ray-actor lifecycle backend) with SPMD sub-stages (rollout, reward, reference, policy update) running underneath.

When to use SPMD

  • ✅ Pre-training, SFT, DPO, and knowledge distillation — any workload whose step function is identical across workers and whose signal is dense.
  • ❌ On-policy RL with separate Policy / Rollout / Reward / Reference roles — use single-controller over SPMD.
Last updated · 550 distilled / 1,221 read