CONCEPT Cited by 1 source
NMSE (Normalized Mean Squared Error)¶
Normalized Mean Squared Error (NMSE) is the graded-scale judge-vs-human agreement metric used by Dropbox Dash's relevance judge (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy). It is MSE on the 1–5 relevance scale, rescaled to a 0–100 range.
- 0 = perfect agreement (judge matches every human label exactly).
- 100 = worst-case disagreement.
- Small disagreements (4 vs 5) contribute a small squared penalty; large disagreements (1 vs 5) contribute a much larger one — same quadratic shape as MSE.
Why "normalized"¶
Raw MSE on the 1–5 scale has range 0–16
(see the companion metric used in
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).
The 0–100 rescaling makes successive optimisation runs easier to
report as rounded scores and improvements as percentages — the
gpt-oss-120b run went 8.83 → 4.86 (−45%), the gemma-3-12b
run went 46.88 → 17.26 (−63%). Both numbers come directly from
the same underlying per-example squared error, just projected onto
a friendlier display scale.
Why it's the optimiser objective¶
- Bounded. Both above and below; optimisation progress is reportable as a percentage.
- Smooth. Gradients (or DSPy's reflection feedback) have a graded signal, not a binary pass/fail — a 4 vs 5 gives a different feedback shape from a 1 vs 5.
- Human-interpretable. Unlike rank correlations, a 45% reduction in NMSE is easy to explain as "the judge is closer to humans on average."
Usage at Dash¶
- Objective fed to DSPy's GEPA and MIPROv2 optimisers when adapting the relevance judge across models.
- Also gates the human-calibrated LLM-labeling pipeline: the judge must hit a minimum NMSE against the human seed set before it's allowed to generate production training labels.
Relationship to NDCG¶
NMSE is a judge-vs-human metric: how close is the LLM rating to the human rating for each (query, doc) pair? NDCG is a ranker-vs-labels metric: how close is the ranker's ordering to the ideal ordering under the labels? The chain is:
NMSE caps NDCG validity. If the judge disagrees with humans, NDCG-measured ranker wins may be illusory.
Tradeoffs¶
- Still a single scalar. Hides directional bias (does the judge consistently over-rate long documents?) and failure-mode shape. Dash pairs NMSE with patterns/behavior-discrepancy-sampling to recover direction.
- Ignores output-validity failures. A malformed JSON response can't be scored — Dash treats those as fully incorrect. This makes JSON validity rate a separate co-equal axis, not a subtract from NMSE.
- Rubric-anchored. Garbage rubric → NMSE converges fast to the wrong target.
Seen in¶
- sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy
— canonical source. Named metric for the DSPy
model-adaptation work across
o3/gpt-oss-120b/gemma-3-12b.
Related¶
- concepts/llm-as-judge — the pattern NMSE measures.
- concepts/relevance-labeling — the 1–5 scale NMSE lives on.
- concepts/ndcg — the downstream metric NMSE gates the validity of.
- concepts/structured-output-reliability — the co-equal reliability axis.
- systems/dspy — the optimiser driving NMSE down.
- patterns/prompt-optimizer-flywheel — the loop NMSE is the objective of.