Skip to content

CONCEPT Cited by 1 source

Direct Preference Optimization (DPO)

Definition

Direct Preference Optimization (DPO) is a post-training method that trains an LLM on pairs of (preferred response, rejected response) examples by optimising a loss derived from the preference comparison directly — no separate reward model, no PPO-style rollouts. It occupies a similar niche to RLHF (refining tone, helpfulness, safety) but with a dense, differentiable loss shape closer to SFT than to on-policy RL.

First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — DPO is one of the four standard recipes in Netflix's Post-Training Framework (alongside SFT, RL, and Knowledge Distillation).

Why DPO fits the SFT execution model

Because DPO's loss is:

  • Dense (per-token contribution from both preferred and rejected trajectories),
  • Immediate (computed within a single step from pre-collected preference pairs),
  • Differentiable (end-to-end through the model, no reward-model indirection),

it runs under the same SPMD execution model that SFT uses. Netflix's framework exposes it as one of four recipes without needing the single-controller orchestration that on-policy RL required.

Relationship to other post-training methods

  • SFT — dense signal on correct outputs; teaches task-shape.
  • DPO — dense signal on preference pairs; refines style/tone/safety without rollouts.
  • On-policy RL (GRPO, PPO) — sparse+delayed scalar reward per rollout; forces single-controller orchestration.

A common sequence: base → CPT → SFT → DPO or on-policy RL.

Last updated · 550 distilled / 1,221 read