CONCEPT Cited by 3 sources
Human–LLM evaluation alignment¶
Definition¶
Human–LLM evaluation alignment is the measured agreement between an LLM judge's verdicts and a human rater's verdicts on the same inputs using the same rubric. It is a first-class quality axis for any LLM-as-judge system — the judge's precision on the ground truth the product cares about (the human's opinion, not an external benchmark).
Alignment is:
- Measured — compute agreement (exact match, Cohen's kappa, NMSE for graded scores, dimensional breakdown for multi- criterion rubrics).
- Driven to a target — through iterative refinement of the judge's rubric + prompts.
- Regression-tested — re-measured on every judge update (new model, new prompt, new criterion).
Why it matters¶
An LLM judge that scores 99% on an external benchmark is useless if it disagrees with your humans on what "good" means for your product. "For our LLM-based evaluations to drive meaningful improvements, they must closely mirror human assessments" (Instacart LACE).
The insight is that human opinion is the ground truth for product-quality judgements — not an academic benchmark, not the judge model's pretraining distribution. Alignment gives you a measurable signal on how close your judge is to that truth.
Two levers for closing alignment gaps¶
When judge ≠ human, Instacart's LACE team uses two levers (ordered by frequency):
- Refine existing criteria — improve the criterion's definition and its prompt text so the judge has less ambiguity. "Our primary mechanism for improving alignment and was applied frequently."
- Redesign the criteria structure — replace a criterion or dimension with a better-scoped one. "More involved and used sparingly, only when simpler refinements weren't sufficient."
The hierarchy is important: prompt refinement is cheap and reversible; rubric redesign invalidates accumulated evaluation data.
(Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm.)
Dropbox Dash's parallel framing¶
Dropbox Dash's relevance judge measures alignment as NMSE on a graded scale + separately tracks structured-output reliability as an orthogonal axis. Dash uses humans to calibrate the judge then lets the judge label the training set at scale — a force multiplier of ~100× over humans labelling training data directly (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).
Shape of the alignment loop¶
curated human-rated eval set
│
▼
┌──────────────────────────┐
│ LACE / judge scores │
│ same set │
└───────────┬───────────────┘
│
diff per criterion
│
┌─────────┴─────────┐
│ │
▼ ▼
refine (rare) redesign
criterion rubric structure
prompt
│ │
└───────┬───────────┘
│
▼
re-run, measure, repeat
until alignment is strong
Once strong alignment is achieved, the loop becomes the regression harness: every judge update re-runs the same human-rated set and flags alignment drops before shipping. See patterns/human-aligned-criteria-refinement-loop.
Tradeoffs / gotchas¶
- Human raters disagree with each other. Alignment ceiling is inter-rater agreement, not 100%. For subjective criteria (tier-3 criteria), human inter-rater agreement itself may be the limiting factor — LACE explicitly de-prioritises these rather than chasing a noisy ceiling.
- The ground-truth set can drift. If the human-rated set was curated a year ago and the product surface has changed, the judge can be perfectly aligned to a stale rubric. Refresh the set when the product changes.
- Alignment is per-criterion, not holistic. A judge can score 0.95 on the holistic session score while disagreeing badly on a specific criterion the product cares about most. Always report per-criterion breakdowns.
- Calibration is bidirectional. Sometimes the human is wrong. The alignment loop surfaces not just judge errors but also ambiguous rubric wording that the humans themselves can't consistently apply.
Seen in¶
- sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "implemented an iterative validation process where human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system... repeated this human-LACE comparison and refinement cycle multiple times."
- sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — Dropbox Dash humans-calibrate-the-judge pattern, same structural shape applied to training-data labelling.
- sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — Dash's DSPy-optimised relevance judge measures alignment via NMSE + reliability as orthogonal axes.