Skip to content

PATTERN Cited by 1 source

LLM-as-judge multi-level rubric

LLM-as-judge multi-level rubric is the evaluation pattern of scoring a hierarchical generative artefact (a page with sections with items) at each level of its hierarchy using purpose-written rubrics per level, rather than collapsing the evaluation to a single end-to-end score.

The pattern is distinctive from single-rubric LLM-as-judge because different quality failures manifest at different hierarchical levels — a page can be incoherent despite every individual section being good, a section can be off-brand despite every product being relevant, a product can mismatch the section title while scoring well on the page as a whole. A single-level rubric averages these out; a multi-level one localises them.

Shape

[generated artefact]
  ├─► page-level rubric  → judge#1  → (cohesion, coverage, diversity)
  ├─► section-level rubric  → judge#2  → (title quality, brand, user-preference alignment)
  └─► item-level rubric  → judge#3  → (recall, within-section thematic alignment)

Each level gets its own prompt, its own rubric, and its own calibration. Scores roll up for dashboards; failures route downward for fixing.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase-3 evaluation framework runs LLM-as-judge against the three natural levels of the Shopping Hub page hierarchy:

Page level

  • Does the page feel cohesive enough? Diverse enough?
  • Does the full set of generated placements cover all of our business needs?

Placement level

  • Are the titles of high quality and aligned with our brand?
  • Do placement themes align with user preferences and order behavior?

Product level

  • Have we maintained sufficient product recall in the final output?
  • Are the underlying retrieval keywords and products still aligned with the title's thematic intent?

The post's named implementation discipline: human-in-the-loop (HITL) workflows build ground-truth data and judges are tuned "until passing high human-alignment thresholds." See patterns/human-aligned-criteria-refinement-loop for the calibration workflow at sibling system LACE.

Why the decomposition is load-bearing

Three reasons, named or implied in the post:

  1. Different failure modes at different scales. Page-level failures (too similar placements across the page) don't show up at product level. Product-level failures (wrong SKU in right-themed placement) don't show up at page level. A single rubric averages both failures into a vague score.
  2. Different HITL labelers at different levels. Page-level coherence is a UX call; placement-level brand is a content- strategy call; product-level relevance is a catalog-team call. Multi-level rubrics let each team own their own level's ground truth.
  3. Different fix loops at different levels. Page-level failures go back to Phase 1 (themes / ordering). Placement- level failures go back to Phase 2 (keyword generation). Product-level failures go back to retrieval + Phase 3 filter. The rubric hierarchy mirrors the pipeline hierarchy.

Relationship to DeBERTa cross-encoder

LLM-as-judge at multi-level rubric operates on a sampled basis ("small proportion of users"). The companion patterns/fine-tuned-cross-encoder-as-filter runs on every candidate at >99% cost reduction — Instacart's Phase 3 deploys both, with the cross-encoder specialised to the quality dimension (theme-product relevance) where LLM-as-judge hits diminishing returns at full-catalog scale.

The two patterns compose:

  • LLM-as-judge = multi-dimensional, rationale-emitting, sampled.
  • Cross-encoder = single-dimensional, scalar, full-catalog.

Instacart's explicit justification for running both:

"LLM-as-a-judge evaluators are a powerful tool. However, we found that while this framework guided us well at the averages, it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale."

Sibling: LACE's multi-dimensional chatbot rubric

LACE (the customer-support chatbot evaluation framework) uses a similar multi-dimension idea but at a single trajectory level — five dimensions (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) all applied to the same chat session. See concepts/llm-evaluation-dimensions.

Shopping Hub's multi-level rubric and LACE's multi-dimension rubric are two specialisations of the same underlying discipline: decompose the evaluation axis so different failures become distinguishable.

When the pattern fits

  • Artefact has a natural hierarchy. Page with sections with items. Document with chapters with paragraphs. Conversation with turns with utterances. Otherwise there's no level axis to decompose along.
  • Failures localise at different levels. If every failure manifests at the same level, a single rubric is fine.
  • Different teams own different levels. Multi-level rubrics match team ownership boundaries; that's what keeps the HITL calibration sustainable.

When it doesn't

  • Flat artefact. No hierarchy, no benefit.
  • All levels are identical rubrics. If page-level coherence and placement-level coherence have the same criteria, you have one rubric at two granularities — collapse it.
  • HITL calibration budget is tight. Three rubrics means three calibration workflows; small teams should pick one level and invest deeply there.

Failure modes

  • Rubric overlap. Page-coherence and placement-cohesion overlap — same failure scored twice. Creates correlated noise in the roll-up.
  • Level-specific drift. Page-level judge calibration drifts while placement-level stays aligned. Roll-up scores look fine; page-level regressions go unnoticed.
  • Sampling bias. Each level is sampled separately; joint rare failures (page-coherent but placement-wrong-theme) undersample if the axes aren't joined.
  • Hierarchy false-economy. Running three judges per artefact costs 3× single-judge inference; budget discipline matters.

Relation to sibling patterns

Seen in

Last updated · 517 distilled / 1,221 read