PATTERN Cited by 1 source
Human-aligned criteria refinement loop¶
Intent¶
Bootstrap an LLM-as-judge system to human-grade reliability — and keep it there across updates — by running a continuous calibrate-compare-refine loop against a curated human-rated evaluation set:
- Human raters score a selected chat / item / output set on the rubric.
- The LLM judge scores the same set.
- Misalignments drive refinement:
- Preferred: refine criterion prompt + definition (primary lever, applied frequently).
- Last resort: redesign the criteria structure (rare, used only when prompt-level refinement fails).
- Re-run until alignment is strong.
- Reuse the same loop as regression harness on every judge update (new model, new prompt, new criterion added).
Instacart's LACE canonicalises this pattern for chatbot-evaluation rubrics (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).
Structure¶
(one-time) curate human-rated ground-truth set
│
┌────────────────────┴──────────────────────┐
▼ ▼
Human rates the LLM judge rates
set on the rubric the same set on
the rubric
│ │
└──────────────────┬────────────────────────┘
▼
Compute alignment per criterion
│
large misalignment?
┌───────┴───────┐
▼ ▼
refine criterion (rare) redesign the
prompt + definition criteria structure
│
▼
re-run, re-measure, repeat
│
▼
strong alignment reached → framework is bootstrapped
│
▼
on every future LACE update, re-run the same loop
as a regression harness
Two-lever hierarchy¶
The refinement lever ordering is load-bearing:
- Refine existing criteria (cheap, reversible)
- Tighten definitions
- Add few-shot exemplars where confusion exists
- Re-word ambiguous criterion language
- Add operational context (business rules, edge cases) to the criterion prompt
-
Instacart: "our primary mechanism... applied frequently."
-
Redesign criteria structure (expensive, invalidates data)
- Split overlapping criteria
- Merge redundant criteria
- Replace a criterion with a better-framed one
- Remove a criterion whose ceiling is human inter-rater noise
- Instacart: "used sparingly, only when simpler refinements weren't sufficient."
The ordering matters because rubric redesign invalidates accumulated evaluation history — dashboards reset, trends break, A/B-test comparisons across the redesign boundary are suspect.
When to use¶
- Bootstrapping any LLM-as-judge system for a domain-specific rubric where off-the-shelf benchmarks don't capture what the product cares about.
- Regression-testing judge updates — model upgrades, prompt rewrites, criterion additions — against a stable ground truth.
- Closing the loop on subjective dimensions where the only ground truth is human opinion.
Contrast with patterns/human-calibrated-llm-labeling¶
Dropbox Dash's patterns/human-calibrated-llm-labeling is structurally similar but aimed at a different object:
| Aspect | Human-calibrated LLM labelling (Dash) | Human-aligned criteria refinement (LACE) |
|---|---|---|
| Who is calibrated? | The judge (scores training data) | The judge (scores production outputs) |
| Downstream use | labels training data for a ranker | scores chatbot sessions for dashboards + experimentation |
| Criterion type | graded relevance (0–4) | binary True/False across five dimensions |
| Refinement object | judge prompt + DSPy optimiser | criterion prompts + (rarely) rubric structure |
Both patterns humans calibrate, judge labels at scale — force multiplier on scarce human attention.
Tradeoffs / gotchas¶
- The ground-truth set itself can drift. Product surface changes, new features, new user cohorts → yesterday's human- rated set no longer represents today's production traffic. Refresh periodically.
- Inter-rater agreement ceiling. On subjective criteria, humans themselves disagree. Alignment target should be inter-rater agreement, not 100%. If you're trying to push judge-human alignment past human-human alignment, you're chasing noise.
- Criterion redesign loses history. Structurally changing a criterion mid-flight resets the evaluation-trend data for that criterion; dashboards need a visible "v2 rubric from yyyy-mm-dd" annotation.
- Refinement can overfit the ground-truth set. If the curated set is small, continuous refinement can teach the judge the specific set, not the underlying judgment skill. Hold back a fraction as a true test set; rotate.
- Criteria overlap confounds the loop. If two criteria cover overlapping ground, refining one changes the other's measured alignment without anyone touching it. LACE's stance: accept some overlap (the rationale output makes the primary cause auditable) rather than chase impossible rubric orthogonality.
Seen in¶
- sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system. We then compared their ratings to those generated by LACE. When we identified misalignments, we used this feedback to refine our evaluation framework in two ways... We repeated this human-LACE comparison and refinement cycle multiple times until we achieved strong alignment."