CONCEPT Cited by 1 source
Direct Preference Optimization (DPO)¶
Definition¶
Direct Preference Optimization (DPO) is a post-training method that trains an LLM on pairs of (preferred response, rejected response) examples by optimising a loss derived from the preference comparison directly — no separate reward model, no PPO-style rollouts. It occupies a similar niche to RLHF (refining tone, helpfulness, safety) but with a dense, differentiable loss shape closer to SFT than to on-policy RL.
First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — DPO is one of the four standard recipes in Netflix's Post-Training Framework (alongside SFT, RL, and Knowledge Distillation).
Why DPO fits the SFT execution model¶
Because DPO's loss is:
- Dense (per-token contribution from both preferred and rejected trajectories),
- Immediate (computed within a single step from pre-collected preference pairs),
- Differentiable (end-to-end through the model, no reward-model indirection),
it runs under the same SPMD execution model that SFT uses. Netflix's framework exposes it as one of four recipes without needing the single-controller orchestration that on-policy RL required.
Relationship to other post-training methods¶
- SFT — dense signal on correct outputs; teaches task-shape.
- DPO — dense signal on preference pairs; refines style/tone/safety without rollouts.
- On-policy RL (GRPO, PPO) — sparse+delayed scalar reward per rollout; forces single-controller orchestration.
A common sequence: base → CPT → SFT → DPO or on-policy RL.