Skip to content

PATTERN Cited by 1 source

Human-aligned criteria refinement loop

Intent

Bootstrap an LLM-as-judge system to human-grade reliability — and keep it there across updates — by running a continuous calibrate-compare-refine loop against a curated human-rated evaluation set:

  1. Human raters score a selected chat / item / output set on the rubric.
  2. The LLM judge scores the same set.
  3. Misalignments drive refinement:
  4. Preferred: refine criterion prompt + definition (primary lever, applied frequently).
  5. Last resort: redesign the criteria structure (rare, used only when prompt-level refinement fails).
  6. Re-run until alignment is strong.
  7. Reuse the same loop as regression harness on every judge update (new model, new prompt, new criterion added).

Instacart's LACE canonicalises this pattern for chatbot-evaluation rubrics (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Structure

      (one-time) curate human-rated ground-truth set
       ┌────────────────────┴──────────────────────┐
       ▼                                           ▼
  Human rates the                          LLM judge rates
  set on the rubric                        the same set on
                                           the rubric
       │                                           │
       └──────────────────┬────────────────────────┘
              Compute alignment per criterion
                 large misalignment?
                  ┌───────┴───────┐
                  ▼               ▼
          refine criterion       (rare) redesign the
          prompt + definition    criteria structure
            re-run, re-measure, repeat
     strong alignment reached → framework is bootstrapped
     on every future LACE update, re-run the same loop
     as a regression harness

Two-lever hierarchy

The refinement lever ordering is load-bearing:

  1. Refine existing criteria (cheap, reversible)
  2. Tighten definitions
  3. Add few-shot exemplars where confusion exists
  4. Re-word ambiguous criterion language
  5. Add operational context (business rules, edge cases) to the criterion prompt
  6. Instacart: "our primary mechanism... applied frequently."

  7. Redesign criteria structure (expensive, invalidates data)

  8. Split overlapping criteria
  9. Merge redundant criteria
  10. Replace a criterion with a better-framed one
  11. Remove a criterion whose ceiling is human inter-rater noise
  12. Instacart: "used sparingly, only when simpler refinements weren't sufficient."

The ordering matters because rubric redesign invalidates accumulated evaluation history — dashboards reset, trends break, A/B-test comparisons across the redesign boundary are suspect.

When to use

  • Bootstrapping any LLM-as-judge system for a domain-specific rubric where off-the-shelf benchmarks don't capture what the product cares about.
  • Regression-testing judge updates — model upgrades, prompt rewrites, criterion additions — against a stable ground truth.
  • Closing the loop on subjective dimensions where the only ground truth is human opinion.

Contrast with patterns/human-calibrated-llm-labeling

Dropbox Dash's patterns/human-calibrated-llm-labeling is structurally similar but aimed at a different object:

Aspect Human-calibrated LLM labelling (Dash) Human-aligned criteria refinement (LACE)
Who is calibrated? The judge (scores training data) The judge (scores production outputs)
Downstream use labels training data for a ranker scores chatbot sessions for dashboards + experimentation
Criterion type graded relevance (0–4) binary True/False across five dimensions
Refinement object judge prompt + DSPy optimiser criterion prompts + (rarely) rubric structure

Both patterns humans calibrate, judge labels at scale — force multiplier on scarce human attention.

Tradeoffs / gotchas

  • The ground-truth set itself can drift. Product surface changes, new features, new user cohorts → yesterday's human- rated set no longer represents today's production traffic. Refresh periodically.
  • Inter-rater agreement ceiling. On subjective criteria, humans themselves disagree. Alignment target should be inter-rater agreement, not 100%. If you're trying to push judge-human alignment past human-human alignment, you're chasing noise.
  • Criterion redesign loses history. Structurally changing a criterion mid-flight resets the evaluation-trend data for that criterion; dashboards need a visible "v2 rubric from yyyy-mm-dd" annotation.
  • Refinement can overfit the ground-truth set. If the curated set is small, continuous refinement can teach the judge the specific set, not the underlying judgment skill. Hold back a fraction as a true test set; rotate.
  • Criteria overlap confounds the loop. If two criteria cover overlapping ground, refining one changes the other's measured alignment without anyone touching it. LACE's stance: accept some overlap (the rationale output makes the primary cause auditable) rather than chase impossible rubric orthogonality.

Seen in

  • sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system. We then compared their ratings to those generated by LACE. When we identified misalignments, we used this feedback to refine our evaluation framework in two ways... We repeated this human-LACE comparison and refinement cycle multiple times until we achieved strong alignment."
Last updated · 517 distilled / 1,221 read