CONCEPT Cited by 1 source

On-policy RL vs SFT signal shape¶

Definition¶

The signal shape of an LLM post-training method — whether the learning signal is dense and immediate (per-token, per-batch, differentiable end-to-end) or sparse and delayed (a scalar reward per episode, computed after rollout, non-differentiable through the reward source) — determines which distributed-execution model a training framework can support. Netflix uses this contrast to explain why its Post-Training Framework had to evolve from pure-SPMD to a hybrid execution model with a single-controller layer when adding RL.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.

The two shapes¶

SFT (and pre-training, DPO, knowledge distillation): dense + immediate¶

For each token position, compute logits over the full vocabulary.
Compute a differentiable loss at each position.
Backpropagate end-to-end through the model in a single step.

Because the signal is dense and immediate, an SFT step looks identical across every worker in an SPMD cluster — every GPU runs the same forward/backward/optimize function on a different data shard, synchronising through collectives. This maps cleanly to pre-training infrastructure and scales by launching more identical workers.

A scalar reward at the end of an episode (or at the end of a generated trajectory).
The training step depends on data generated by the current policy, not a fixed dataset — so rollout generation is part of every step.
Individual sub-stages — policy update, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination across stages.

Netflix's framing:

"You're constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

Why signal shape drives infrastructure shape¶

Property	SFT (dense + immediate)	On-policy RL (sparse + delayed)
Loss per step	Per token, every position	One scalar per episode
Differentiability	End-to-end	Reward source typically non-differentiable
Data	Fixed dataset shardable in advance	Generated by current policy at every step
Worker roles	One kind (run the step function)	Policy / Rollout / Reward / Reference
Execution model	SPMD	SPMD sub-stages under a single controller
Scaling pattern	Launch more identical workers	Reallocate resources across phase-specific roles

The 2025 turning point¶

Per the Netflix post:

"We initially designed the library around Supervised Fine-Tuning (SFT): relatively static data flow, a single training loop, and a Single Program, Multiple Data (SPMD) execution model. That assumption stopped holding in 2025. With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line."

The industry-wide post-training frontier shifted such that any framework built only around the dense+immediate signal shape needed a refactor to support sparse+delayed signals — not just by adding new loss functions, but by changing execution orchestration.

Practical implication¶

If you're building an LLM post-training platform today, design the execution model to be hybrid from day one: SFT stays SPMD, RL runs SPMD sub-stages under a single controller, and the user-facing API is unified so developers can move between them without switching mental models. See patterns/hybrid-single-controller-plus-spmd-rl.