Skip to content

CONCEPT Cited by 1 source

Single-controller RL orchestration

Definition

Single-controller RL orchestration is an execution model for on-policy RL post-training in which the driver node stops being a passive launcher of identical workers and becomes an active controller that encodes the control plane — when to generate rollouts, how to batch and score them, when to trigger optimization steps, and how to reallocate cluster resources across RL phases. It contrasts with the pure-SPMD execution model that works for SFT.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix documents their shift from SPMD to single-controller (backed by Verl's Ray-actor orchestration) when RL support became a first-class framework requirement.

Why RL needs a single controller

Netflix's articulation of the problem:

"On-policy RL changes the shape of the system. The learning signal is typically sparse and delayed (e.g., a scalar reward at the end of an episode), and the training step depends on data generated by the current policy. Individual sub-stages — policy updates, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination: you're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

The invariant that breaks SPMD:

  • No single step function covers the whole algorithm. Rollout generation, reward scoring, reference inference, and policy updates are different computations running on different subsets of workers at different times.
  • Data dependency on the current policy's outputs. You can't pipeline or lockstep-synchronise because what the Rollout Workers produce depends on what the last Policy update changed — and Reward Model scoring consumes those rollouts before optimisation can begin.
  • GPU resource allocation shifts across phases. A cluster configured for rollout throughput is not configured for policy-update throughput; a controller has to reallocate.

Anatomy of the controller

Under the single-controller pattern:

  1. Driver becomes active. Not a launcher — a scheduler.
  2. Workers decompose into roles. Policy / Rollout Workers / Reward Model / Reference Model — each an SPMD workload underneath, but a distinct role from the controller's perspective.
  3. Control-plane handoffs are explicit. Prompts → trajectories → rewards → advantages → policy-update gradients. The controller threads these.
  4. Resource allocation is phase-aware. Which actors get which GPUs changes with the RL phase.

Relationship to SPMD (it's hybrid, not a replacement)

Single-controller is layered over SPMD, not instead of it. The Netflix framework integrates Verl's Ray-actor lifecycle + GPU-resource-allocation backend as the controller layer, and sub-stages (rollout, reward, reference, policy update) remain SPMD. SFT continues using the original pure-SPMD path. The user-facing API is unified, so developers "move between SFT and RL workflows without adopting an entirely different mental model or API set." See patterns/hybrid-single-controller-plus-spmd-rl.

Why this is an industry-wide shift

The post situates the trend:

"With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from 'offline training loop' to 'multi-stage, on-policy orchestration.'" (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

Any LLM post-training platform built before ~2025 against an SFT-only SPMD assumption faces the same refactor.

Last updated · 550 distilled / 1,221 read