Skip to content

CONCEPT Cited by 1 source

NMSE (Normalized Mean Squared Error)

Normalized Mean Squared Error (NMSE) is the graded-scale judge-vs-human agreement metric used by Dropbox Dash's relevance judge (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy). It is MSE on the 1–5 relevance scale, rescaled to a 0–100 range.

  • 0 = perfect agreement (judge matches every human label exactly).
  • 100 = worst-case disagreement.
  • Small disagreements (4 vs 5) contribute a small squared penalty; large disagreements (1 vs 5) contribute a much larger one — same quadratic shape as MSE.

Why "normalized"

Raw MSE on the 1–5 scale has range 0–16 (see the companion metric used in sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search). The 0–100 rescaling makes successive optimisation runs easier to report as rounded scores and improvements as percentages — the gpt-oss-120b run went 8.83 → 4.86 (−45%), the gemma-3-12b run went 46.88 → 17.26 (−63%). Both numbers come directly from the same underlying per-example squared error, just projected onto a friendlier display scale.

Why it's the optimiser objective

  • Bounded. Both above and below; optimisation progress is reportable as a percentage.
  • Smooth. Gradients (or DSPy's reflection feedback) have a graded signal, not a binary pass/fail — a 4 vs 5 gives a different feedback shape from a 1 vs 5.
  • Human-interpretable. Unlike rank correlations, a 45% reduction in NMSE is easy to explain as "the judge is closer to humans on average."

Usage at Dash

  • Objective fed to DSPy's GEPA and MIPROv2 optimisers when adapting the relevance judge across models.
  • Also gates the human-calibrated LLM-labeling pipeline: the judge must hit a minimum NMSE against the human seed set before it's allowed to generate production training labels.

Relationship to NDCG

NMSE is a judge-vs-human metric: how close is the LLM rating to the human rating for each (query, doc) pair? NDCG is a ranker-vs-labels metric: how close is the ranker's ordering to the ideal ordering under the labels? The chain is:

judge prompt ──► NMSE (vs human) ──► trusted labels ──► NDCG (vs labels) ──► trusted ranker

NMSE caps NDCG validity. If the judge disagrees with humans, NDCG-measured ranker wins may be illusory.

Tradeoffs

  • Still a single scalar. Hides directional bias (does the judge consistently over-rate long documents?) and failure-mode shape. Dash pairs NMSE with patterns/behavior-discrepancy-sampling to recover direction.
  • Ignores output-validity failures. A malformed JSON response can't be scored — Dash treats those as fully incorrect. This makes JSON validity rate a separate co-equal axis, not a subtract from NMSE.
  • Rubric-anchored. Garbage rubric → NMSE converges fast to the wrong target.

Seen in

Last updated · 200 distilled / 1,178 read