CONCEPT Cited by 1 source

Single-controller RL orchestration¶

Definition¶

Single-controller RL orchestration is an execution model for on-policy RL post-training in which the driver node stops being a passive launcher of identical workers and becomes an active controller that encodes the control plane — when to generate rollouts, how to batch and score them, when to trigger optimization steps, and how to reallocate cluster resources across RL phases. It contrasts with the pure-SPMD execution model that works for SFT.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix documents their shift from SPMD to single-controller (backed by Verl's Ray-actor orchestration) when RL support became a first-class framework requirement.

Why RL needs a single controller¶

Netflix's articulation of the problem:

"On-policy RL changes the shape of the system. The learning signal is typically sparse and delayed (e.g., a scalar reward at the end of an episode), and the training step depends on data generated by the current policy. Individual sub-stages — policy updates, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination: you're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

The invariant that breaks SPMD:

No single step function covers the whole algorithm. Rollout generation, reward scoring, reference inference, and policy updates are different computations running on different subsets of workers at different times.
Data dependency on the current policy's outputs. You can't pipeline or lockstep-synchronise because what the Rollout Workers produce depends on what the last Policy update changed — and Reward Model scoring consumes those rollouts before optimisation can begin.
GPU resource allocation shifts across phases. A cluster configured for rollout throughput is not configured for policy-update throughput; a controller has to reallocate.

Anatomy of the controller¶

Under the single-controller pattern:

Driver becomes active. Not a launcher — a scheduler.
Workers decompose into roles. Policy / Rollout Workers / Reward Model / Reference Model — each an SPMD workload underneath, but a distinct role from the controller's perspective.
Control-plane handoffs are explicit. Prompts → trajectories → rewards → advantages → policy-update gradients. The controller threads these.
Resource allocation is phase-aware. Which actors get which GPUs changes with the RL phase.

Relationship to SPMD (it's hybrid, not a replacement)¶

Single-controller is layered over SPMD, not instead of it. The Netflix framework integrates Verl's Ray-actor lifecycle + GPU-resource-allocation backend as the controller layer, and sub-stages (rollout, reward, reference, policy update) remain SPMD. SFT continues using the original pure-SPMD path. The user-facing API is unified, so developers "move between SFT and RL workflows without adopting an entirely different mental model or API set." See patterns/hybrid-single-controller-plus-spmd-rl.

Why this is an industry-wide shift¶

The post situates the trend:

"With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from 'offline training loop' to 'multi-stage, on-policy orchestration.'" (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

Any LLM post-training platform built before ~2025 against an SFT-only SPMD assumption faces the same refactor.