CONCEPT Cited by 1 source

Direct Preference Optimization (DPO)¶

Definition¶

Direct Preference Optimization (DPO) is a post-training method that trains an LLM on pairs of (preferred response, rejected response) examples by optimising a loss derived from the preference comparison directly — no separate reward model, no PPO-style rollouts. It occupies a similar niche to RLHF (refining tone, helpfulness, safety) but with a dense, differentiable loss shape closer to SFT than to on-policy RL.

First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — DPO is one of the four standard recipes in Netflix's Post-Training Framework (alongside SFT, RL, and Knowledge Distillation).

Why DPO fits the SFT execution model¶

Because DPO's loss is:

Dense (per-token contribution from both preferred and rejected trajectories),
Immediate (computed within a single step from pre-collected preference pairs),
Differentiable (end-to-end through the model, no reward-model indirection),

it runs under the same SPMD execution model that SFT uses. Netflix's framework exposes it as one of four recipes without needing the single-controller orchestration that on-policy RL required.

Relationship to other post-training methods¶

SFT — dense signal on correct outputs; teaches task-shape.
DPO — dense signal on preference pairs; refines style/tone/safety without rollouts.
On-policy RL (GRPO, PPO) — sparse+delayed scalar reward per rollout; forces single-controller orchestration.

A common sequence: base → CPT → SFT → DPO or on-policy RL.

Direct Preference Optimization (DPO)¶

Definition¶

Why DPO fits the SFT execution model¶

Relationship to other post-training methods¶

Related¶