CONCEPT Cited by 3 sources

Human–LLM evaluation alignment¶

Definition¶

Human–LLM evaluation alignment is the measured agreement between an LLM judge's verdicts and a human rater's verdicts on the same inputs using the same rubric. It is a first-class quality axis for any LLM-as-judge system — the judge's precision on the ground truth the product cares about (the human's opinion, not an external benchmark).

Alignment is:

Measured — compute agreement (exact match, Cohen's kappa, NMSE for graded scores, dimensional breakdown for multi- criterion rubrics).
Driven to a target — through iterative refinement of the judge's rubric + prompts.
Regression-tested — re-measured on every judge update (new model, new prompt, new criterion).

Why it matters¶

An LLM judge that scores 99% on an external benchmark is useless if it disagrees with your humans on what "good" means for your product. "For our LLM-based evaluations to drive meaningful improvements, they must closely mirror human assessments" (Instacart LACE).

The insight is that human opinion is the ground truth for product-quality judgements — not an academic benchmark, not the judge model's pretraining distribution. Alignment gives you a measurable signal on how close your judge is to that truth.

Two levers for closing alignment gaps¶

When judge ≠ human, Instacart's LACE team uses two levers (ordered by frequency):

Refine existing criteria — improve the criterion's definition and its prompt text so the judge has less ambiguity. "Our primary mechanism for improving alignment and was applied frequently."
Redesign the criteria structure — replace a criterion or dimension with a better-scoped one. "More involved and used sparingly, only when simpler refinements weren't sufficient."

The hierarchy is important: prompt refinement is cheap and reversible; rubric redesign invalidates accumulated evaluation data.

(Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm.)

Dropbox Dash's parallel framing¶

Dropbox Dash's relevance judge measures alignment as NMSE on a graded scale + separately tracks structured-output reliability as an orthogonal axis. Dash uses humans to calibrate the judge then lets the judge label the training set at scale — a force multiplier of ~100× over humans labelling training data directly (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).

Shape of the alignment loop¶

       curated human-rated eval set
                │
                ▼
    ┌──────────────────────────┐
    │ LACE / judge scores       │
    │ same set                  │
    └───────────┬───────────────┘
                │
         diff per criterion
                │
      ┌─────────┴─────────┐
      │                   │
      ▼                   ▼
  refine             (rare) redesign
  criterion          rubric structure
  prompt
      │                   │
      └───────┬───────────┘
              │
              ▼
    re-run, measure, repeat
    until alignment is strong

Once strong alignment is achieved, the loop becomes the regression harness: every judge update re-runs the same human-rated set and flags alignment drops before shipping. See patterns/human-aligned-criteria-refinement-loop.

Tradeoffs / gotchas¶

Human raters disagree with each other. Alignment ceiling is inter-rater agreement, not 100%. For subjective criteria (tier-3 criteria), human inter-rater agreement itself may be the limiting factor — LACE explicitly de-prioritises these rather than chasing a noisy ceiling.
The ground-truth set can drift. If the human-rated set was curated a year ago and the product surface has changed, the judge can be perfectly aligned to a stale rubric. Refresh the set when the product changes.
Alignment is per-criterion, not holistic. A judge can score 0.95 on the holistic session score while disagreeing badly on a specific criterion the product cares about most. Always report per-criterion breakdowns.
Calibration is bidirectional. Sometimes the human is wrong. The alignment loop surfaces not just judge errors but also ambiguous rubric wording that the humans themselves can't consistently apply.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "implemented an iterative validation process where human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system... repeated this human-LACE comparison and refinement cycle multiple times."
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — Dropbox Dash humans-calibrate-the-judge pattern, same structural shape applied to training-data labelling.
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — Dash's DSPy-optimised relevance judge measures alignment via NMSE + reliability as orthogonal axes.