Human-aligned criteria refinement loop¶

PATTERN Cited by 1 source

Intent¶

Bootstrap an LLM-as-judge system to human-grade reliability — and keep it there across updates — by running a continuous calibrate-compare-refine loop against a curated human-rated evaluation set:

Human raters score a selected chat / item / output set on the rubric.
The LLM judge scores the same set.
Misalignments drive refinement:
Preferred: refine criterion prompt + definition (primary lever, applied frequently).
Last resort: redesign the criteria structure (rare, used only when prompt-level refinement fails).
Re-run until alignment is strong.
Reuse the same loop as regression harness on every judge update (new model, new prompt, new criterion added).

Instacart's LACE canonicalises this pattern for chatbot-evaluation rubrics (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Structure¶

      (one-time) curate human-rated ground-truth set
                            │
       ┌────────────────────┴──────────────────────┐
       ▼                                           ▼
  Human rates the                          LLM judge rates
  set on the rubric                        the same set on
                                           the rubric
       │                                           │
       └──────────────────┬────────────────────────┘
                          ▼
              Compute alignment per criterion
                          │
                 large misalignment?
                  ┌───────┴───────┐
                  ▼               ▼
          refine criterion       (rare) redesign the
          prompt + definition    criteria structure
                  │
                  ▼
            re-run, re-measure, repeat
                  │
                  ▼
     strong alignment reached → framework is bootstrapped
                  │
                  ▼
     on every future LACE update, re-run the same loop
     as a regression harness

Two-lever hierarchy¶

The refinement lever ordering is load-bearing:

Refine existing criteria (cheap, reversible)
Tighten definitions
Add few-shot exemplars where confusion exists
Re-word ambiguous criterion language
Add operational context (business rules, edge cases) to the criterion prompt
Instacart: "our primary mechanism... applied frequently."
Redesign criteria structure (expensive, invalidates data)
Split overlapping criteria
Merge redundant criteria
Replace a criterion with a better-framed one
Remove a criterion whose ceiling is human inter-rater noise
Instacart: "used sparingly, only when simpler refinements weren't sufficient."

The ordering matters because rubric redesign invalidates accumulated evaluation history — dashboards reset, trends break, A/B-test comparisons across the redesign boundary are suspect.

When to use¶

Bootstrapping any LLM-as-judge system for a domain-specific rubric where off-the-shelf benchmarks don't capture what the product cares about.
Regression-testing judge updates — model upgrades, prompt rewrites, criterion additions — against a stable ground truth.
Closing the loop on subjective dimensions where the only ground truth is human opinion.

Contrast with patterns/human-calibrated-llm-labeling ¶

Dropbox Dash's patterns/human-calibrated-llm-labeling is structurally similar but aimed at a different object:

Aspect	Human-calibrated LLM labelling (Dash)	Human-aligned criteria refinement (LACE)
Who is calibrated?	The judge (scores training data)	The judge (scores production outputs)
Downstream use	labels training data for a ranker	scores chatbot sessions for dashboards + experimentation
Criterion type	graded relevance (0–4)	binary True/False across five dimensions
Refinement object	judge prompt + DSPy optimiser	criterion prompts + (rarely) rubric structure

Both patterns humans calibrate, judge labels at scale — force multiplier on scarce human attention.

Tradeoffs / gotchas¶

The ground-truth set itself can drift. Product surface changes, new features, new user cohorts → yesterday's human- rated set no longer represents today's production traffic. Refresh periodically.
Inter-rater agreement ceiling. On subjective criteria, humans themselves disagree. Alignment target should be inter-rater agreement, not 100%. If you're trying to push judge-human alignment past human-human alignment, you're chasing noise.
Criterion redesign loses history. Structurally changing a criterion mid-flight resets the evaluation-trend data for that criterion; dashboards need a visible "v2 rubric from yyyy-mm-dd" annotation.
Refinement can overfit the ground-truth set. If the curated set is small, continuous refinement can teach the judge the specific set, not the underlying judgment skill. Hold back a fraction as a true test set; rotate.
Criteria overlap confounds the loop. If two criteria cover overlapping ground, refining one changes the other's measured alignment without anyone touching it. LACE's stance: accept some overlap (the rationale output makes the primary cause auditable) rather than chase impossible rubric orthogonality.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system. We then compared their ratings to those generated by LACE. When we identified misalignments, we used this feedback to refine our evaluation framework in two ways... We repeated this human-LACE comparison and refinement cycle multiple times until we achieved strong alignment."