CONCEPT Cited by 1 source
On-policy RL vs SFT signal shape¶
Definition¶
The signal shape of an LLM post-training method — whether the learning signal is dense and immediate (per-token, per-batch, differentiable end-to-end) or sparse and delayed (a scalar reward per episode, computed after rollout, non-differentiable through the reward source) — determines which distributed-execution model a training framework can support. Netflix uses this contrast to explain why its Post-Training Framework had to evolve from pure-SPMD to a hybrid execution model with a single-controller layer when adding RL.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.
The two shapes¶
SFT (and pre-training, DPO, knowledge distillation): dense + immediate¶
- For each token position, compute logits over the full vocabulary.
- Compute a differentiable loss at each position.
- Backpropagate end-to-end through the model in a single step.
Because the signal is dense and immediate, an SFT step looks identical across every worker in an SPMD cluster — every GPU runs the same forward/backward/optimize function on a different data shard, synchronising through collectives. This maps cleanly to pre-training infrastructure and scales by launching more identical workers.
On-policy RL (GRPO, PPO, and related): sparse + delayed¶
- A scalar reward at the end of an episode (or at the end of a generated trajectory).
- The training step depends on data generated by the current policy, not a fixed dataset — so rollout generation is part of every step.
- Individual sub-stages — policy update, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination across stages.
Netflix's framing:
"You're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
Why signal shape drives infrastructure shape¶
| Property | SFT (dense + immediate) | On-policy RL (sparse + delayed) |
|---|---|---|
| Loss per step | Per token, every position | One scalar per episode |
| Differentiability | End-to-end | Reward source typically non-differentiable |
| Data | Fixed dataset shardable in advance | Generated by current policy at every step |
| Worker roles | One kind (run the step function) | Policy / Rollout / Reward / Reference |
| Execution model | SPMD | SPMD sub-stages under a single controller |
| Scaling pattern | Launch more identical workers | Reallocate resources across phase-specific roles |
The 2025 turning point¶
Per the Netflix post:
"We initially designed the library around Supervised Fine-Tuning (SFT): relatively static data flow, a single training loop, and a Single Program, Multiple Data (SPMD) execution model. That assumption stopped holding in 2025. With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line."
The industry-wide post-training frontier shifted such that any framework built only around the dense+immediate signal shape needed a refactor to support sparse+delayed signals — not just by adding new loss functions, but by changing execution orchestration.
Practical implication¶
If you're building an LLM post-training platform today, design the execution model to be hybrid from day one: SFT stays SPMD, RL runs SPMD sub-stages under a single controller, and the user-facing API is unified so developers can move between them without switching mental models. See patterns/hybrid-single-controller-plus-spmd-rl.