Skip to content

SYSTEM Cited by 1 source

Netflix Synopsis Judge

Netflix Synopsis Judge is Netflix's production LLM-as-a-Judge system for scoring the quality of the show synopses shown to members as personalised promotional assets. It scores synopses along four creative-quality dimensions (tone, clarity, precision, factuality) at ≥85% agreement with creative writers, and its scores are validated downstream against take fraction and abandonment rate — the two A/B-validated short-term behavioural proxies for long-term member retention. (Source: sources/2026-04-10-netflix-evaluating-netflix-show-synopses-with-llm-as-a-judge)

Purpose in Netflix's authoring workflow

The system's named production win is proactive pre-launch regression detection: "allowing us to proactively identify and fix impactful issues weeks or months before a show debuts on Netflix." Human creative writers remain the authoritative source of creative quality; the judge is a scaling + consistency layer that catches failing criteria across hundreds of thousands of synopses, usually with multiple variants per show, before human review would otherwise be scheduled.

Two quality definitions Netflix jointly tracks

  • Creative quality — judged against an internal creative-writing rubric by expert writers; the judge learns to approximate this.
  • Member implicit feedback — take fraction + abandonment rate, A/B-validated short-term proxies for long-term retention.

The judge panel is architected to approximate creative quality; the member-feedback axis is used as external validation that the judge scores capture "factors that matter to members" — not as the judge's training objective.

Judge panel architecture

  • One dedicated judge per criterion. Single-prompt multi-criterion judging was found to "overload the LLM"; dedicated per-criterion judges perform strictly better. Same LLM backbone used across criteria; each criterion has its own prompt, metadata set, rubric slice, and zero-shot chain-of-thought elicitation before the binary decision.
  • Binary outputs throughout. Each criterion judge emits a pass / fail binary decision preceded by a rationale. Accuracy on the golden set is computed as simple agreement with the human-consensus binary label. (Rationale: concepts/binary-vs-likert-scoring.)
  • Tiered rationales for subjective criteria. On tone-like criteria the judge "reasons at any length but concisely summarises its reasoning process prior to the final score." Tone binary accuracy lifted 86.55% → 87.85% from tiered rationales alone.
  • Consensus scoring on long-rationale judges. For tone + clarity, 5× consensus scoring (rounded-average across 5 samples) yields a clear accuracy boost. For precision (short chain-of-thought), consensus yielded no benefit — short rationales produce near-zero-variance scores that consensus can't stabilise.
  • Factuality as an agent panel. Factuality is decomposed into four narrow agents: (1) plot (requires plot summary or script), (2) metadata (genre / location / release date), (3) on-/off-screen talent, (4) awards. Each agent emits its own rationale + binary score. System-level factuality score = minimum across agents"any failed aspect yields an overall fail." An LLM aggregator merges per-agent rationales into a combined human-readable explanation. Tiered rationales + consensus scoring apply within each agent.

The human golden set (upstream of the judges)

Netflix built the golden-set labels via a three-stage sequence:

  1. Multi-writer parallel labelling. ~1,000 diverse synopses initially labelled by three expert writers each.
  2. Calibration rounds. Eight rounds × ~50 synopses/round, surfacing disagreements, refining the rubric. Three named interventions during calibration lifted inter-writer agreement to ~80%:
  3. Switch from 1–4 Likert → binary scoring (concepts/binary-vs-likert-scoring).
  4. Allow writers to reference past labelled examples.
  5. Maintain a searchable taxonomy of common errors.
  6. Model-in-the-loop consensus. Multiple writers score each remaining synopsis; an LLM guided by the rubric aggregates to a final binary label; writers review cases with substantial disagreement. Output: ~600-synopsis golden set with binary criteria-level scores + explanations — "our North Star for aligning an LLM judge with expert opinion."

Prompt engineering + APO

  • Automatic Prompt Optimization is applied per-criterion over a ~300-sample dev set; scoring guidelines are passed as optimiser context.
  • Candidate prompts emitted by APO are manually refined with LLM assistance before selection.
  • Post-APO accuracy varies substantially by criterion (precision works out-of-the-box; clarity needs tiered rationales + consensus).
  • APO is the 2023-era predecessor to DSPy's optimiser flywheel; Netflix does not report a DSPy comparison on this system.

Reasoning-model experiment: skipped for cost

Netflix tested true reasoning models (models that produce long reasoning trajectories before output) on tone with 5× consensus. Increasing reasoning effort improved tone accuracy monotonically and, at maximum effort, beat tiered rationales. Netflix skipped reasoning models in the final system — inference cost increase judged too large for "only a marginal performance gain." Canonical wiki data point for the cost-vs-accuracy tradeoff of reasoning models in a production LLM-as-judge deployment.

External validation: judge scores vs member behaviour

Netflix validates the judge panel end-to-end by regressing within-show changes in LLM score against within-show changes in take fraction and abandonment rate, normalising by show-level standard deviation and clustering standard errors by show:

  • Precision + clarity are especially predictive individually.
  • A Weighted Score combining all criteria gives a statistically useful signal of higher take fraction + lower abandonment.
  • Netflix explicitly flags the analysis is observational, not experimental: "we don't have clean, experimental variation in LLM scores" — so the finding is interpreted as "predictive value and practical utility", not causal.

This is a two-stage evaluation architecture — golden-set alignment is necessary but not sufficient; behavioural correlation is the external anchor.

Scale + adoption

  • "Hundreds of thousands of synopses, usually with multiple variants per show."
  • ≥85% agreement with creative writers as the system-level headline.
  • "Widespread adoption in the Netflix synopsis authoring workflow" as the deployment claim.

What the post does not disclose

  • Judge LLM identity (model / provider / version).
  • Reasoning-model identity and per-effort-level accuracy numbers.
  • APO tool / framework used to run optimisation at scale.
  • Inference cost of 5× consensus or the 4-agent factuality panel.
  • End-to-end accuracies for all criteria beyond tone's 86.55% → 87.85% — the post shows per-criterion accuracy as bar charts but doesn't quote the numbers in prose.
  • Number of criteria in the final system — tone / clarity / precision / factuality are named; others implied.
  • Production-ingress plumbing — batch vs on-write scheduling, judge-score storage, writer-tool integration, the alerting / gating mechanism that drives "weeks or months before debut" intervention.
  • Non-English synopses — Netflix serves many locales; the post doesn't say whether the same judges run across locales or whether per-locale judges exist.
  • Whether creative-writer inter-rater agreement re-calibration is continuous — golden set is a snapshot; rubrics "change over time as quality standards evolve", so re-calibration is implied but not described.

Relation to other wiki LLM-as-judge instances

Last updated · 319 distilled / 1,201 read