CONCEPT Cited by 1 source

NMSE (Normalized Mean Squared Error)¶

Normalized Mean Squared Error (NMSE) is the graded-scale judge-vs-human agreement metric used by Dropbox Dash's relevance judge (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy). It is MSE on the 1–5 relevance scale, rescaled to a 0–100 range.

0 = perfect agreement (judge matches every human label exactly).
100 = worst-case disagreement.
Small disagreements (4 vs 5) contribute a small squared penalty; large disagreements (1 vs 5) contribute a much larger one — same quadratic shape as MSE.

Why "normalized"¶

Raw MSE on the 1–5 scale has range 0–16 (see the companion metric used in sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search). The 0–100 rescaling makes successive optimisation runs easier to report as rounded scores and improvements as percentages — the gpt-oss-120b run went 8.83 → 4.86 (−45%), the gemma-3-12b run went 46.88 → 17.26 (−63%). Both numbers come directly from the same underlying per-example squared error, just projected onto a friendlier display scale.

Why it's the optimiser objective¶

Bounded. Both above and below; optimisation progress is reportable as a percentage.
Smooth. Gradients (or DSPy's reflection feedback) have a graded signal, not a binary pass/fail — a 4 vs 5 gives a different feedback shape from a 1 vs 5.
Human-interpretable. Unlike rank correlations, a 45% reduction in NMSE is easy to explain as "the judge is closer to humans on average."

Usage at Dash¶

Objective fed to DSPy's GEPA and MIPROv2 optimisers when adapting the relevance judge across models.
Also gates the human-calibrated LLM-labeling pipeline: the judge must hit a minimum NMSE against the human seed set before it's allowed to generate production training labels.

Relationship to NDCG¶

NMSE is a judge-vs-human metric: how close is the LLM rating to the human rating for each (query, doc) pair? NDCG is a ranker-vs-labels metric: how close is the ranker's ordering to the ideal ordering under the labels? The chain is:

judge prompt ──► NMSE (vs human) ──► trusted labels ──► NDCG (vs labels) ──► trusted ranker

NMSE caps NDCG validity. If the judge disagrees with humans, NDCG-measured ranker wins may be illusory.

Tradeoffs¶

Still a single scalar. Hides directional bias (does the judge consistently over-rate long documents?) and failure-mode shape. Dash pairs NMSE with patterns/behavior-discrepancy-sampling to recover direction.
Ignores output-validity failures. A malformed JSON response can't be scored — Dash treats those as fully incorrect. This makes JSON validity rate a separate co-equal axis, not a subtract from NMSE.
Rubric-anchored. Garbage rubric → NMSE converges fast to the wrong target.

Seen in¶

sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — canonical source. Named metric for the DSPy model-adaptation work across o3 / gpt-oss-120b / gemma-3-12b.

concepts/llm-as-judge — the pattern NMSE measures.
concepts/relevance-labeling — the 1–5 scale NMSE lives on.
concepts/ndcg — the downstream metric NMSE gates the validity of.
concepts/structured-output-reliability — the co-equal reliability axis.
systems/dspy — the optimiser driving NMSE down.
patterns/prompt-optimizer-flywheel — the loop NMSE is the objective of.