PATTERN Cited by 1 source
Hybrid single-controller + SPMD RL execution¶
Intent¶
Support both SFT and on-policy RL (GRPO, PPO) workflows in a single LLM post-training framework without forcing SFT users to pay for RL's orchestration overhead — by running SFT under a pure SPMD model and running RL sub-stages (rollout, reward, reference, policy update) as SPMD workloads underneath a single-controller orchestration layer.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — Netflix integrates Verl's Ray-actor-lifecycle + GPU-resource-allocation backend as the single-controller layer, keeps its existing SPMD SFT path, and exposes a unified user API.
Problem¶
An LLM post-training platform built around a pure SPMD execution model (as Netflix's was initially) fits SFT cleanly — every GPU runs the same step function on different data, synchronised by collectives. But on-policy RL in 2025 (post-DeepSeek-R1, post-GRPO adoption) broke that assumption:
- Distinct worker roles (Policy, Rollout Workers, Reward Model, Reference Model) — no single step function covers them.
- End-to-end coordination — handing off prompts → trajectories → rewards → advantages → gradients.
- Phase-aware resource allocation — the cluster partition for rollout throughput differs from the one for policy updates.
Two anti-patterns to avoid:
- Force RL into pure SPMD: fundamentally wrong shape; won't scale.
- Replace SPMD entirely with single-controller: pays RL's orchestration cost on SFT workloads, which don't need it, and breaks the existing SFT path.
Solution¶
Layer single-controller orchestration over SPMD sub-stages:
┌──────────────────────────────────────────┐
│ Framework user API (unified SFT + RL) │
├──────────────────────────────────────────┤
│ Single-controller layer (Verl / Ray) │ ← active for RL
│ - actor lifecycle │
│ - GPU resource allocation per phase │
│ - control-plane handoffs │
├──────────────────────────────────────────┤
│ SPMD sub-stages │
│ - rollout workers (SPMD) │
│ - reward model scoring (SPMD) │
│ - reference model inference (SPMD) │
│ - policy update (SPMD) │
└──────────────────────────────────────────┘
┌──────────────────────────────────────────┐
│ SFT path (unchanged pure SPMD) │
│ - every worker runs same step fn │
│ - collectives at sync points │
└──────────────────────────────────────────┘
Key design commitments:
- Unified user API. Developers move between SFT and RL without adopting an entirely different mental model or API set.
- Integrate OSS for the orchestration layer. Don't reinvent Ray-actor-lifecycle / GPU resource allocation — use Verl (or equivalent). Your value-add lives in the modelling surface area.
- Keep SFT fast. Don't route SFT through the single-controller layer; it pays orchestration overhead for nothing.
- Reallocate resources phase-by-phase. The controller manages GPU resources across RL phases — a cluster isn't statically partitioned into "Policy GPUs" and "Rollout GPUs"; those roles shift.
Applicability¶
- ✅ LLM post-training platforms that need to support both SFT and on-policy RL.
- ✅ Any framework where the OSS ecosystem already provides a mature orchestration layer (Verl for RL; comparable libraries may emerge for other multi-stage workloads).
- ❌ Platforms that will only ever do SFT / DPO / KD — pure SPMD is simpler.
- ❌ Platforms that will only ever do on-policy RL — you still want the SPMD sub-stages, but you don't need the hybrid user API.
Trade-offs¶
| Benefit | Cost |
|---|---|
| Single framework for SFT + RL | Two execution paths to maintain |
| SFT keeps its existing perf profile | Single-controller layer introduces orchestration concepts (actors, roles, phases) developers must understand when they touch RL |
| Reuses OSS orchestration (Verl) | Tight coupling to the OSS project's API stability |
| Resources can reallocate per RL phase | Cluster scheduler complexity |
Consequences¶
- Framework team focuses modelling work on Data/Model/Compute abstractions; orchestration concerns live in the OSS layer.
- SFT → DPO → KD are all dense+immediate signal workloads (concepts/on-policy-rl-vs-sft-signal-shape) and run on the SPMD path.
- On-policy RL (GRPO, PPO) runs under the single-controller layer.
- Future multi-stage training methods (iterative distillation loops, multi-teacher distillation, agent-RL) fit under the same single-controller path without further architectural change.
Known uses¶
- Netflix Post-Training Framework (2026-02) — canonical instance. Integrates Verl's Ray-actor backend; keeps SFT on its original SPMD path.