SYSTEM Cited by 1 source
Verl¶
Verl (github.com/verl-project/verl) is an open-source library for orchestrating distributed RL post-training workloads on top of Ray. It provides Ray-actor lifecycle management and GPU resource allocation for the distinct roles in on-policy RL — Policy, Rollout Workers, Reward Model, Reference Model — so that frameworks using it can focus on the modelling surface area rather than the control plane.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix, where Netflix's Post-Training Framework documents integrating Verl's core infrastructure to support RL without reinventing distributed orchestration.
Role¶
On-policy RL breaks the pure-SPMD execution model that SFT inherits from pre-training: individual sub-stages (policy updates, rollout generation, reference model inference, reward model scoring) can each be SPMD, but the end-to-end algorithm requires explicit coordination — passing prompts/trajectories/rewards/advantages across stages and synchronising their lifecycle.
Verl is the single-controller layer (see concepts/single-controller-rl-orchestration) that sits above Ray and handles:
- Ray actor lifecycle for Policy / Rollout / Reward / Reference workers.
- GPU resource allocation across those roles per phase.
- The control-plane logic: when to rollout, how to batch+score, when to trigger optimisation, how to reallocate cluster resources across phases.
Integration surface in Netflix's framework¶
Per the Netflix post:
"To add RL support without reinventing distributed orchestration from scratch, we integrated the core infrastructure from the open-source Verl library to manage Ray actor lifecycle and GPU resource allocation. Leveraging Verl's backend let us focus on the 'modeling surface area' — our Data/Model/Compute abstractions and internal optimizations — while keeping orchestration concerns decoupled. The result is a hybrid design: a unified user interface where developers can move between SFT and RL workflows without adopting an entirely different mental model or API set."
This is a textbook case of the patterns/hybrid-single-controller-plus-spmd-rl pattern — Netflix's framework layers its modelling API on Verl's orchestration, while SFT continues using the original SPMD path.