Skip to content

PATTERN Cited by 1 source

Hybrid single-controller + SPMD RL execution

Intent

Support both SFT and on-policy RL (GRPO, PPO) workflows in a single LLM post-training framework without forcing SFT users to pay for RL's orchestration overhead — by running SFT under a pure SPMD model and running RL sub-stages (rollout, reward, reference, policy update) as SPMD workloads underneath a single-controller orchestration layer.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — Netflix integrates Verl's Ray-actor-lifecycle + GPU-resource-allocation backend as the single-controller layer, keeps its existing SPMD SFT path, and exposes a unified user API.

Problem

An LLM post-training platform built around a pure SPMD execution model (as Netflix's was initially) fits SFT cleanly — every GPU runs the same step function on different data, synchronised by collectives. But on-policy RL in 2025 (post-DeepSeek-R1, post-GRPO adoption) broke that assumption:

  • Distinct worker roles (Policy, Rollout Workers, Reward Model, Reference Model) — no single step function covers them.
  • End-to-end coordination — handing off prompts → trajectories → rewards → advantages → gradients.
  • Phase-aware resource allocation — the cluster partition for rollout throughput differs from the one for policy updates.

Two anti-patterns to avoid:

  1. Force RL into pure SPMD: fundamentally wrong shape; won't scale.
  2. Replace SPMD entirely with single-controller: pays RL's orchestration cost on SFT workloads, which don't need it, and breaks the existing SFT path.

Solution

Layer single-controller orchestration over SPMD sub-stages:

┌──────────────────────────────────────────┐
│  Framework user API (unified SFT + RL)   │
├──────────────────────────────────────────┤
│  Single-controller layer (Verl / Ray)    │  ← active for RL
│    - actor lifecycle                     │
│    - GPU resource allocation per phase   │
│    - control-plane handoffs              │
├──────────────────────────────────────────┤
│  SPMD sub-stages                         │
│    - rollout workers    (SPMD)           │
│    - reward model scoring   (SPMD)       │
│    - reference model inference  (SPMD)   │
│    - policy update  (SPMD)               │
└──────────────────────────────────────────┘

┌──────────────────────────────────────────┐
│  SFT path (unchanged pure SPMD)          │
│    - every worker runs same step fn      │
│    - collectives at sync points          │
└──────────────────────────────────────────┘

Key design commitments:

  1. Unified user API. Developers move between SFT and RL without adopting an entirely different mental model or API set.
  2. Integrate OSS for the orchestration layer. Don't reinvent Ray-actor-lifecycle / GPU resource allocation — use Verl (or equivalent). Your value-add lives in the modelling surface area.
  3. Keep SFT fast. Don't route SFT through the single-controller layer; it pays orchestration overhead for nothing.
  4. Reallocate resources phase-by-phase. The controller manages GPU resources across RL phases — a cluster isn't statically partitioned into "Policy GPUs" and "Rollout GPUs"; those roles shift.

Applicability

  • ✅ LLM post-training platforms that need to support both SFT and on-policy RL.
  • ✅ Any framework where the OSS ecosystem already provides a mature orchestration layer (Verl for RL; comparable libraries may emerge for other multi-stage workloads).
  • ❌ Platforms that will only ever do SFT / DPO / KD — pure SPMD is simpler.
  • ❌ Platforms that will only ever do on-policy RL — you still want the SPMD sub-stages, but you don't need the hybrid user API.

Trade-offs

Benefit Cost
Single framework for SFT + RL Two execution paths to maintain
SFT keeps its existing perf profile Single-controller layer introduces orchestration concepts (actors, roles, phases) developers must understand when they touch RL
Reuses OSS orchestration (Verl) Tight coupling to the OSS project's API stability
Resources can reallocate per RL phase Cluster scheduler complexity

Consequences

  • Framework team focuses modelling work on Data/Model/Compute abstractions; orchestration concerns live in the OSS layer.
  • SFT → DPO → KD are all dense+immediate signal workloads (concepts/on-policy-rl-vs-sft-signal-shape) and run on the SPMD path.
  • On-policy RL (GRPO, PPO) runs under the single-controller layer.
  • Future multi-stage training methods (iterative distillation loops, multi-teacher distillation, agent-RL) fit under the same single-controller path without further architectural change.

Known uses

Last updated · 550 distilled / 1,221 read