Skip to content

CONCEPT Cited by 1 source

On-policy RL vs SFT signal shape

Definition

The signal shape of an LLM post-training method — whether the learning signal is dense and immediate (per-token, per-batch, differentiable end-to-end) or sparse and delayed (a scalar reward per episode, computed after rollout, non-differentiable through the reward source) — determines which distributed-execution model a training framework can support. Netflix uses this contrast to explain why its Post-Training Framework had to evolve from pure-SPMD to a hybrid execution model with a single-controller layer when adding RL.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.

The two shapes

SFT (and pre-training, DPO, knowledge distillation): dense + immediate

  • For each token position, compute logits over the full vocabulary.
  • Compute a differentiable loss at each position.
  • Backpropagate end-to-end through the model in a single step.

Because the signal is dense and immediate, an SFT step looks identical across every worker in an SPMD cluster — every GPU runs the same forward/backward/optimize function on a different data shard, synchronising through collectives. This maps cleanly to pre-training infrastructure and scales by launching more identical workers.

  • A scalar reward at the end of an episode (or at the end of a generated trajectory).
  • The training step depends on data generated by the current policy, not a fixed dataset — so rollout generation is part of every step.
  • Individual sub-stages — policy update, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination across stages.

Netflix's framing:

"You're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

Why signal shape drives infrastructure shape

Property SFT (dense + immediate) On-policy RL (sparse + delayed)
Loss per step Per token, every position One scalar per episode
Differentiability End-to-end Reward source typically non-differentiable
Data Fixed dataset shardable in advance Generated by current policy at every step
Worker roles One kind (run the step function) Policy / Rollout / Reward / Reference
Execution model SPMD SPMD sub-stages under a single controller
Scaling pattern Launch more identical workers Reallocate resources across phase-specific roles

The 2025 turning point

Per the Netflix post:

"We initially designed the library around Supervised Fine-Tuning (SFT): relatively static data flow, a single training loop, and a Single Program, Multiple Data (SPMD) execution model. That assumption stopped holding in 2025. With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line."

The industry-wide post-training frontier shifted such that any framework built only around the dense+immediate signal shape needed a refactor to support sparse+delayed signals — not just by adding new loss functions, but by changing execution orchestration.

Practical implication

If you're building an LLM post-training platform today, design the execution model to be hybrid from day one: SFT stays SPMD, RL runs SPMD sub-stages under a single controller, and the user-facing API is unified so developers can move between them without switching mental models. See patterns/hybrid-single-controller-plus-spmd-rl.

Last updated · 550 distilled / 1,221 read