Skip to content

CONCEPT Cited by 3 sources

Human–LLM evaluation alignment

Definition

Human–LLM evaluation alignment is the measured agreement between an LLM judge's verdicts and a human rater's verdicts on the same inputs using the same rubric. It is a first-class quality axis for any LLM-as-judge system — the judge's precision on the ground truth the product cares about (the human's opinion, not an external benchmark).

Alignment is:

  1. Measured — compute agreement (exact match, Cohen's kappa, NMSE for graded scores, dimensional breakdown for multi- criterion rubrics).
  2. Driven to a target — through iterative refinement of the judge's rubric + prompts.
  3. Regression-tested — re-measured on every judge update (new model, new prompt, new criterion).

Why it matters

An LLM judge that scores 99% on an external benchmark is useless if it disagrees with your humans on what "good" means for your product. "For our LLM-based evaluations to drive meaningful improvements, they must closely mirror human assessments" (Instacart LACE).

The insight is that human opinion is the ground truth for product-quality judgements — not an academic benchmark, not the judge model's pretraining distribution. Alignment gives you a measurable signal on how close your judge is to that truth.

Two levers for closing alignment gaps

When judge ≠ human, Instacart's LACE team uses two levers (ordered by frequency):

  1. Refine existing criteria — improve the criterion's definition and its prompt text so the judge has less ambiguity. "Our primary mechanism for improving alignment and was applied frequently."
  2. Redesign the criteria structure — replace a criterion or dimension with a better-scoped one. "More involved and used sparingly, only when simpler refinements weren't sufficient."

The hierarchy is important: prompt refinement is cheap and reversible; rubric redesign invalidates accumulated evaluation data.

(Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm.)

Dropbox Dash's parallel framing

Dropbox Dash's relevance judge measures alignment as NMSE on a graded scale + separately tracks structured-output reliability as an orthogonal axis. Dash uses humans to calibrate the judge then lets the judge label the training set at scale — a force multiplier of ~100× over humans labelling training data directly (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).

Shape of the alignment loop

       curated human-rated eval set
    ┌──────────────────────────┐
    │ LACE / judge scores       │
    │ same set                  │
    └───────────┬───────────────┘
         diff per criterion
      ┌─────────┴─────────┐
      │                   │
      ▼                   ▼
  refine             (rare) redesign
  criterion          rubric structure
  prompt
      │                   │
      └───────┬───────────┘
    re-run, measure, repeat
    until alignment is strong

Once strong alignment is achieved, the loop becomes the regression harness: every judge update re-runs the same human-rated set and flags alignment drops before shipping. See patterns/human-aligned-criteria-refinement-loop.

Tradeoffs / gotchas

  • Human raters disagree with each other. Alignment ceiling is inter-rater agreement, not 100%. For subjective criteria (tier-3 criteria), human inter-rater agreement itself may be the limiting factor — LACE explicitly de-prioritises these rather than chasing a noisy ceiling.
  • The ground-truth set can drift. If the human-rated set was curated a year ago and the product surface has changed, the judge can be perfectly aligned to a stale rubric. Refresh the set when the product changes.
  • Alignment is per-criterion, not holistic. A judge can score 0.95 on the holistic session score while disagreeing badly on a specific criterion the product cares about most. Always report per-criterion breakdowns.
  • Calibration is bidirectional. Sometimes the human is wrong. The alignment loop surfaces not just judge errors but also ambiguous rubric wording that the humans themselves can't consistently apply.

Seen in

Last updated · 517 distilled / 1,221 read