CONCEPT Cited by 1 source
Single-controller RL orchestration¶
Definition¶
Single-controller RL orchestration is an execution model for on-policy RL post-training in which the driver node stops being a passive launcher of identical workers and becomes an active controller that encodes the control plane — when to generate rollouts, how to batch and score them, when to trigger optimization steps, and how to reallocate cluster resources across RL phases. It contrasts with the pure-SPMD execution model that works for SFT.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix documents their shift from SPMD to single-controller (backed by Verl's Ray-actor orchestration) when RL support became a first-class framework requirement.
Why RL needs a single controller¶
Netflix's articulation of the problem:
"On-policy RL changes the shape of the system. The learning signal is typically sparse and delayed (e.g., a scalar reward at the end of an episode), and the training step depends on data generated by the current policy. Individual sub-stages — policy updates, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination: you're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
The invariant that breaks SPMD:
- No single step function covers the whole algorithm. Rollout generation, reward scoring, reference inference, and policy updates are different computations running on different subsets of workers at different times.
- Data dependency on the current policy's outputs. You can't pipeline or lockstep-synchronise because what the Rollout Workers produce depends on what the last Policy update changed — and Reward Model scoring consumes those rollouts before optimisation can begin.
- GPU resource allocation shifts across phases. A cluster configured for rollout throughput is not configured for policy-update throughput; a controller has to reallocate.
Anatomy of the controller¶
Under the single-controller pattern:
- Driver becomes active. Not a launcher — a scheduler.
- Workers decompose into roles. Policy / Rollout Workers / Reward Model / Reference Model — each an SPMD workload underneath, but a distinct role from the controller's perspective.
- Control-plane handoffs are explicit. Prompts → trajectories → rewards → advantages → policy-update gradients. The controller threads these.
- Resource allocation is phase-aware. Which actors get which GPUs changes with the RL phase.
Relationship to SPMD (it's hybrid, not a replacement)¶
Single-controller is layered over SPMD, not instead of it. The Netflix framework integrates Verl's Ray-actor lifecycle + GPU-resource-allocation backend as the controller layer, and sub-stages (rollout, reward, reference, policy update) remain SPMD. SFT continues using the original pure-SPMD path. The user-facing API is unified, so developers "move between SFT and RL workflows without adopting an entirely different mental model or API set." See patterns/hybrid-single-controller-plus-spmd-rl.
Why this is an industry-wide shift¶
The post situates the trend:
"With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from 'offline training loop' to 'multi-stage, on-policy orchestration.'" (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
Any LLM post-training platform built before ~2025 against an SFT-only SPMD assumption faces the same refactor.