PATTERN Cited by 1 source
LLM-as-judge multi-level rubric¶
LLM-as-judge multi-level rubric is the evaluation pattern of scoring a hierarchical generative artefact (a page with sections with items) at each level of its hierarchy using purpose-written rubrics per level, rather than collapsing the evaluation to a single end-to-end score.
The pattern is distinctive from single-rubric LLM-as-judge because different quality failures manifest at different hierarchical levels — a page can be incoherent despite every individual section being good, a section can be off-brand despite every product being relevant, a product can mismatch the section title while scoring well on the page as a whole. A single-level rubric averages these out; a multi-level one localises them.
Shape¶
[generated artefact]
│
├─► page-level rubric → judge#1 → (cohesion, coverage, diversity)
│
├─► section-level rubric → judge#2 → (title quality, brand, user-preference alignment)
│
└─► item-level rubric → judge#3 → (recall, within-section thematic alignment)
Each level gets its own prompt, its own rubric, and its own calibration. Scores roll up for dashboards; failures route downward for fixing.
Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶
Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms
Instacart's Phase-3 evaluation framework runs LLM-as-judge against the three natural levels of the Shopping Hub page hierarchy:
Page level¶
- Does the page feel cohesive enough? Diverse enough?
- Does the full set of generated placements cover all of our business needs?
Placement level¶
- Are the titles of high quality and aligned with our brand?
- Do placement themes align with user preferences and order behavior?
Product level¶
- Have we maintained sufficient product recall in the final output?
- Are the underlying retrieval keywords and products still aligned with the title's thematic intent?
The post's named implementation discipline: human-in-the-loop (HITL) workflows build ground-truth data and judges are tuned "until passing high human-alignment thresholds." See patterns/human-aligned-criteria-refinement-loop for the calibration workflow at sibling system LACE.
Why the decomposition is load-bearing¶
Three reasons, named or implied in the post:
- Different failure modes at different scales. Page-level failures (too similar placements across the page) don't show up at product level. Product-level failures (wrong SKU in right-themed placement) don't show up at page level. A single rubric averages both failures into a vague score.
- Different HITL labelers at different levels. Page-level coherence is a UX call; placement-level brand is a content- strategy call; product-level relevance is a catalog-team call. Multi-level rubrics let each team own their own level's ground truth.
- Different fix loops at different levels. Page-level failures go back to Phase 1 (themes / ordering). Placement- level failures go back to Phase 2 (keyword generation). Product-level failures go back to retrieval + Phase 3 filter. The rubric hierarchy mirrors the pipeline hierarchy.
Relationship to DeBERTa cross-encoder¶
LLM-as-judge at multi-level rubric operates on a sampled basis ("small proportion of users"). The companion patterns/fine-tuned-cross-encoder-as-filter runs on every candidate at >99% cost reduction — Instacart's Phase 3 deploys both, with the cross-encoder specialised to the quality dimension (theme-product relevance) where LLM-as-judge hits diminishing returns at full-catalog scale.
The two patterns compose:
- LLM-as-judge = multi-dimensional, rationale-emitting, sampled.
- Cross-encoder = single-dimensional, scalar, full-catalog.
Instacart's explicit justification for running both:
"LLM-as-a-judge evaluators are a powerful tool. However, we found that while this framework guided us well at the averages, it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale."
Sibling: LACE's multi-dimensional chatbot rubric¶
LACE (the customer-support chatbot evaluation framework) uses a similar multi-dimension idea but at a single trajectory level — five dimensions (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) all applied to the same chat session. See concepts/llm-evaluation-dimensions.
Shopping Hub's multi-level rubric and LACE's multi-dimension rubric are two specialisations of the same underlying discipline: decompose the evaluation axis so different failures become distinguishable.
When the pattern fits¶
- Artefact has a natural hierarchy. Page with sections with items. Document with chapters with paragraphs. Conversation with turns with utterances. Otherwise there's no level axis to decompose along.
- Failures localise at different levels. If every failure manifests at the same level, a single rubric is fine.
- Different teams own different levels. Multi-level rubrics match team ownership boundaries; that's what keeps the HITL calibration sustainable.
When it doesn't¶
- Flat artefact. No hierarchy, no benefit.
- All levels are identical rubrics. If page-level coherence and placement-level coherence have the same criteria, you have one rubric at two granularities — collapse it.
- HITL calibration budget is tight. Three rubrics means three calibration workflows; small teams should pick one level and invest deeply there.
Failure modes¶
- Rubric overlap. Page-coherence and placement-cohesion overlap — same failure scored twice. Creates correlated noise in the roll-up.
- Level-specific drift. Page-level judge calibration drifts while placement-level stays aligned. Roll-up scores look fine; page-level regressions go unnoticed.
- Sampling bias. Each level is sampled separately; joint rare failures (page-coherent but placement-wrong-theme) undersample if the axes aren't joined.
- Hierarchy false-economy. Running three judges per artefact costs 3× single-judge inference; budget discipline matters.
Relation to sibling patterns¶
- patterns/fine-tuned-cross-encoder-as-filter — complementary scale multiplier for the specific quality dimension where LLM-as-judge hits cost ceilings.
- patterns/human-aligned-criteria-refinement-loop — the calibration workflow that tunes each level's rubric against HITL ground truth.
- patterns/vlm-evaluator-quality-gate — the image-output sibling at PIXEL (single-level).
Seen in¶
- sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Shopping Hub evaluation framework. Three-level rubric (page / placement / product) with HITL-calibrated human-alignment thresholds.
Related¶
- patterns/top-down-cascaded-page-generation — the host pattern Phase 3 sits inside.
- patterns/fine-tuned-cross-encoder-as-filter — the scale- multiplier complement.
- patterns/human-aligned-criteria-refinement-loop — the calibration discipline.
- concepts/llm-as-judge — the parent concept.
- concepts/llm-evaluation-dimensions — the dimension-axis sibling (LACE's five-dimension rubric).
- concepts/human-llm-evaluation-alignment — the calibration target.
- systems/instacart-generative-recommendations-platform — canonical production consumer.
- systems/lace-instacart — sibling Instacart multi-dimensional judge.
- companies/instacart