PATTERN Cited by 1 source

LLM-as-judge multi-level rubric¶

LLM-as-judge multi-level rubric is the evaluation pattern of scoring a hierarchical generative artefact (a page with sections with items) at each level of its hierarchy using purpose-written rubrics per level, rather than collapsing the evaluation to a single end-to-end score.

The pattern is distinctive from single-rubric LLM-as-judge because different quality failures manifest at different hierarchical levels — a page can be incoherent despite every individual section being good, a section can be off-brand despite every product being relevant, a product can mismatch the section title while scoring well on the page as a whole. A single-level rubric averages these out; a multi-level one localises them.

Shape¶

[generated artefact]
  │
  ├─► page-level rubric  → judge#1  → (cohesion, coverage, diversity)
  │
  ├─► section-level rubric  → judge#2  → (title quality, brand, user-preference alignment)
  │
  └─► item-level rubric  → judge#3  → (recall, within-section thematic alignment)

Each level gets its own prompt, its own rubric, and its own calibration. Scores roll up for dashboards; failures route downward for fixing.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase-3 evaluation framework runs LLM-as-judge against the three natural levels of the Shopping Hub page hierarchy:

Page level¶

Does the page feel cohesive enough? Diverse enough?
Does the full set of generated placements cover all of our business needs?

Placement level¶

Are the titles of high quality and aligned with our brand?
Do placement themes align with user preferences and order behavior?

Product level¶

Have we maintained sufficient product recall in the final output?
Are the underlying retrieval keywords and products still aligned with the title's thematic intent?

The post's named implementation discipline: human-in-the-loop (HITL) workflows build ground-truth data and judges are tuned "until passing high human-alignment thresholds." See patterns/human-aligned-criteria-refinement-loop for the calibration workflow at sibling system LACE.

Why the decomposition is load-bearing¶

Three reasons, named or implied in the post:

Different failure modes at different scales. Page-level failures (too similar placements across the page) don't show up at product level. Product-level failures (wrong SKU in right-themed placement) don't show up at page level. A single rubric averages both failures into a vague score.
Different HITL labelers at different levels. Page-level coherence is a UX call; placement-level brand is a content- strategy call; product-level relevance is a catalog-team call. Multi-level rubrics let each team own their own level's ground truth.
Different fix loops at different levels. Page-level failures go back to Phase 1 (themes / ordering). Placement- level failures go back to Phase 2 (keyword generation). Product-level failures go back to retrieval + Phase 3 filter. The rubric hierarchy mirrors the pipeline hierarchy.

Relationship to DeBERTa cross-encoder¶

LLM-as-judge at multi-level rubric operates on a sampled basis ("small proportion of users"). The companion patterns/fine-tuned-cross-encoder-as-filter runs on every candidate at >99% cost reduction — Instacart's Phase 3 deploys both, with the cross-encoder specialised to the quality dimension (theme-product relevance) where LLM-as-judge hits diminishing returns at full-catalog scale.

The two patterns compose:

LLM-as-judge = multi-dimensional, rationale-emitting, sampled.
Cross-encoder = single-dimensional, scalar, full-catalog.

Instacart's explicit justification for running both:

"LLM-as-a-judge evaluators are a powerful tool. However, we found that while this framework guided us well at the averages, it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale."

Sibling: LACE's multi-dimensional chatbot rubric¶

LACE (the customer-support chatbot evaluation framework) uses a similar multi-dimension idea but at a single trajectory level — five dimensions (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) all applied to the same chat session. See concepts/llm-evaluation-dimensions.

Shopping Hub's multi-level rubric and LACE's multi-dimension rubric are two specialisations of the same underlying discipline: decompose the evaluation axis so different failures become distinguishable.

When the pattern fits¶

Artefact has a natural hierarchy. Page with sections with items. Document with chapters with paragraphs. Conversation with turns with utterances. Otherwise there's no level axis to decompose along.
Failures localise at different levels. If every failure manifests at the same level, a single rubric is fine.
Different teams own different levels. Multi-level rubrics match team ownership boundaries; that's what keeps the HITL calibration sustainable.

When it doesn't¶

Flat artefact. No hierarchy, no benefit.
All levels are identical rubrics. If page-level coherence and placement-level coherence have the same criteria, you have one rubric at two granularities — collapse it.
HITL calibration budget is tight. Three rubrics means three calibration workflows; small teams should pick one level and invest deeply there.

Failure modes¶

Rubric overlap. Page-coherence and placement-cohesion overlap — same failure scored twice. Creates correlated noise in the roll-up.
Level-specific drift. Page-level judge calibration drifts while placement-level stays aligned. Roll-up scores look fine; page-level regressions go unnoticed.
Sampling bias. Each level is sampled separately; joint rare failures (page-coherent but placement-wrong-theme) undersample if the axes aren't joined.
Hierarchy false-economy. Running three judges per artefact costs 3× single-judge inference; budget discipline matters.

Relation to sibling patterns¶

patterns/fine-tuned-cross-encoder-as-filter — complementary scale multiplier for the specific quality dimension where LLM-as-judge hits cost ceilings.
patterns/human-aligned-criteria-refinement-loop — the calibration workflow that tunes each level's rubric against HITL ground truth.
patterns/vlm-evaluator-quality-gate — the image-output sibling at PIXEL (single-level).

Seen in¶

sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Shopping Hub evaluation framework. Three-level rubric (page / placement / product) with HITL-calibrated human-alignment thresholds.

patterns/top-down-cascaded-page-generation — the host pattern Phase 3 sits inside.
patterns/fine-tuned-cross-encoder-as-filter — the scale- multiplier complement.
patterns/human-aligned-criteria-refinement-loop — the calibration discipline.
concepts/llm-as-judge — the parent concept.
concepts/llm-evaluation-dimensions — the dimension-axis sibling (LACE's five-dimension rubric).
concepts/human-llm-evaluation-alignment — the calibration target.
systems/instacart-generative-recommendations-platform — canonical production consumer.
systems/lace-instacart — sibling Instacart multi-dimensional judge.
companies/instacart