CONCEPT Cited by 13 sources

LLM as Judge¶

LLM-as-judge is the evaluation pattern in which one LLM scores another model's (or agent's) output against a rubric — accuracy, helpfulness, policy adherence, structural correctness — replacing or augmenting human evaluation during iteration on non-deterministic components.

Why it matters¶

LLM outputs and agent trajectories are non-deterministic: the same prompt can produce different answers across runs. Classical unit/integration tests with exact-match assertions become flaky or over-constrained. A judge LLM gives a structural score ("did this response correctly identify the slow query? did it recommend a safe action?") that tolerates wording variation while catching true regressions.

Typical usage¶

Regression harness. Snapshot production-state inputs → replay through candidate agent configs → judge scores new outputs against reference.
Leaderboard evaluation. Score many model/prompt/tool combinations against a shared rubric to pick the next rollout candidate.
Per-response observability. Post-hoc judge scoring on production traffic to flag degrading responses.
Trajectory scoring. Score the full agent trace (tool calls + intermediate reasoning + final answer) against a ground-truth RCA with a depth / completeness / telemetry-surfaced rubric. Surfaces partial credit and reasoning-shape regressions that terminal-answer scoring misses.
Up-funnel label-pipeline validation. Judge-style scoring applied before evaluation — confidence scores on agent-assisted-generated labels, filtering sub-threshold labels to human review. See patterns/agent-assisted-label-validation. Datadog shipped this on top of Bits AI SRE as the mechanism that cut label-validation time >95%.
Judge-as-labeler for training data. LLM judge scores (query, document) pairs on a graded relevance scale (concepts/relevance-labeling); the scored pairs become the supervised training set for a downstream learning-to-rank model (patterns/human-calibrated-llm-labeling). Humans don't label the training set — humans label a small seed set that calibrates the judge. Dropbox Dash's production pattern (see systems/dash-relevance-ranker); force multiplier ~100×.
Plan-sufficiency judgment inside the generation loop. Judge scores whether the current plan is adequate at each step of an agent's inner loop, gating add-or-fix refinement — not scoring a completed trajectory or output post-hoc. Google Research's DS-STAR is the canonical wiki instance: the Verifier agent "is an LLM-based judge prompted to determine if the current plan is adequate"; on reject, a Router agent chooses add-vs-fix on the plan and the loop repeats (up to 10 rounds). See patterns/planner-coder-verifier-router-loop and concepts/iterative-plan-refinement. This is load-bearing specifically for open-ended problems "that lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct."

Tradeoffs / gotchas¶

The judge has its own biases. A judge LLM can reward verbose or confident-sounding wrong answers. Rubrics must be specific enough to resist these modes.
Judge drift. When the judge model is updated, past scores are not directly comparable. Snapshot the judge version alongside the run.
Cost. Every replay requires both the candidate and the judge; a batch eval is 2× the inference cost.
Not a safety proof. Judging "did the response look good" is not the same as "was the recommended action safe". Mutating actions still need guardrails outside the judge loop.

Seen in¶

sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Production-monitoring-against-concept-drift face with conservative pass/fail/unknown ternary, in an Entity Resolution pipeline at 17M+ asset scale. Third major LLM-as-judge face on the wiki: not eval-harness for agent iteration (systems/storex) and not judge-as-labeler for ranking-data (systems/dash-relevance-ranker) but the continuous-evaluation substrate against concept drift in production, with the judge model deliberately conservative. Cited at two altitudes in the source: (a) MLflow-anchored: "We implemented a comprehensive evaluation strategy using 'LLM as a Judge' alongside manual labeling sessions. MLflow capabilities allowed us to constantly evaluate model performance to prevent concept drift."; (b) inline-the-CSAF-ETL: "To keep this pipeline reliable at scale, we use an LLM as a Judge approach to continuously score the quality of our own LLM outputs. Instead of relying only on fully labeled ground truth — which is often missing or ambiguous in real-world CPS data — we let a dedicated judge model review another model's response and decide whether it looks acceptable. The judge's job is simple and conservative: mark each result as pass, looks correct, fail, looks wrong, or unknown, not enough information." Three structural moves: (1) the pass/fail/unknown ternary is explicit — not a continuous score and not a binary — so ambiguous cases route to "unknown" rather than being forced into pass-or-fail (especially load-bearing when ground-truth is missing or ambiguous); (2) judge outputs persist in Delta tables so historical judge calls are queryable for drift detection and version comparison; (3) custom MLflow GenAI judges run structured evaluations giving "a consistent way to monitor quality, compare versions, and catch regressions across many LLM use cases — without building a bespoke evaluation stack for every new workflow." Composes with patterns/llm-judge-as-inline-pipeline-stage (synchronous inline-the-pipeline shape) and patterns/hybrid-classical-er-plus-genai (Entity Resolution at scale). Canonical wiki instance: systems/claroty-cps-library.
sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks' systems/storex uses a snapshot-replay harness scored by a judge LLM as its primary regression signal during prompt / tool iteration, referencing MLflow 3's judges documentation.
sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart's LACE (LLM-Assisted Chatbot Evaluation) canonicalises three production judge-design moves: (1) binary True/False scoring beats graded 1-10 on alignment + prompt- engineering cost (session-level scores aggregate from per- criterion binaries); (2) three evaluation engines compared — direct prompting / agentic reflection / agentic debate — debate wins for context-dependent + simple criteria (Customer + Support + Judge sub-agents, parallel critics with no cross-talk); (3) decouple reasoning from structured-output formatting — strong reasoner (o1-preview at writing time) produces free-form rationale, separate step emits JSON; escapes restricted- decoding quality loss. Five-dimension rubric (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) with three-tier complexity grouping (simple / context-dependent / subjective); "near-perfect accuracy" on simple criteria, >90% accuracy on context-dependent (business-model knowledge embedded in static prompt template; RAG-retrieval is future work); subjective criteria de-prioritised ("low-ROI" to refine, keep as directional check only, fix the chatbot rather than the judge). Bootstrap + regression via human-LACE alignment loop — same structural shape as Dropbox Dash human-calibrated labelling, different object (judge-criteria-prompt refinement vs. judge-as-labeller). Production deployment: stratified topic sampling → dashboards → direct experimentation-platform feedback loop. Parallel-play sibling to Lyft AI localisation and Zalando search AI-as-judge — three Tier-2/3 companies shipping the same LLM-as-judge → dashboard → experimentation architecture to different customer-facing surfaces at the same time.
sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog's Bits AI SRE evaluation platform uses judge-style scoring in two places: (a) as the rubric-based scorer over agent trajectories (trajectory evaluation, pass@k, final-answer correctness across tens of thousands of scenarios); (b) as an up-funnel confidence scorer on generated labels themselves — the patterns/agent-assisted-label-validation pipeline passes candidate ground-truth RCAs through confidence scoring across thoroughness / specificity / accuracy dimensions, flagging sub-threshold labels for human review. Alignment studies with human judges gated the trust transition; label-validation time dropped >95% once this was operational.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dropbox Dash's LLM-as-judge iteration arc is a canonical "retrieval-relevance judge" case. Named four-step disagreement-reduction sequence, each step cumulatively lowering judge-vs-human disagreement: (1) baseline judge prompt — ~8% disagreement. (2) prompt refinement — "provide explanations for what you're doing" (classic chain-of-thought framing) — disagreement lower. (3) stronger model — upgraded to OpenAI's o3 reasoning model — lower still. (4) RAG as a judge — let the judge fetch work-context for domain-specific acronyms / terminology it wasn't trained on ("What is RAG?" might mean something internal to Dropbox that the model hasn't seen) — lower still. (5) DSPy layered on top via the patterns/prompt-optimizer-flywheel (bullet-pointed disagreements → optimizer → reduced-disagreement prompt) — "quite pleased with some of the results." LLM-as-judge explicitly called out as a prime DSPy target because judges have "crystal clear rubrics and evals… you know exactly what the outcome should be." Also: Dash uses NDCG scored against judge / human labels as the retrieval quality metric.
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — companion to the Clemm transcript, covering the LLM-as-labeler side: LLM judges generate hundreds of thousands to millions of relevance labels used to train Dash's XGBoost ranker (systems/dash-relevance-ranker). Adds (a) Mean Squared Error on the 1–5 graded scale (range 0–16) as the named judge-vs-human agreement metric — small disagreements get small penalty, wide disagreements get quadratically larger penalty; (b) the human-calibrated LLM labeling pipeline (patterns/human-calibrated-llm-labeling) — humans label a seed set, calibrate the judge, the judge amplifies ~100×; (c) patterns/behavior-discrepancy-sampling — sample for human review the cases where user behaviour disagrees with the LLM's judgment (clicks on low-rated results / skips on high-rated results); (d) patterns/judge-query-context-tooling — the judge is given retrieval tools to research query context before scoring (canonical example: "diet sprite" = internal performance-management tool, not a drink), generalising concepts/rag-as-a-judge from single-retrieval to tool-using agent. DSPy named as prompt-optimisation framework minimising MSE. Explicit framing: "LLMs make it possible to apply human judgment consistently and at scale, rather than replacing it."
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — model-adaptation edition: same Dash relevance judge retargeted across o3 / gpt-oss-120b / gemma-3-12b via DSPy. Introduces two quality axes as first-class peers for an LLM judge: (a) alignment — NMSE on a 1–5 scale rescaled to 0–100 (cut 8.83 → 4.86, −45% on gpt-oss-120b); (b) operational reliability — JSON validity rate (malformed-JSON cut 42% → <1.1% on gemma-3-12b, a 97%+ reduction, with malformed counted as fully incorrect). DSPy's GEPA optimiser drives alignment via feedback strings; DSPy's MIPROv2 optimiser drives both axes simultaneously on small models. Adds three usage regimes by risk tolerance: patterns/cross-model-prompt-adaptation (full rewrite, cheap targets), patterns/instruction-library-prompt-composition (constrained bullet-selection, production o3), and the fixed-model patterns/prompt-optimizer-flywheel. Formalises overfitting guardrails (invariants live in the feedback string). Cycle time: 1–2 weeks manual → 1–2 days with DSPy. Label coverage: 10–100× at same cost on cheaper targets.
sources/2025-11-06-google-ds-star-versatile-data-science-agent — plan-sufficiency edition: the judge sits inside the agent's generation loop, not in an eval harness. Google Research's DS-STAR uses a Verifier LLM to score whether the current plan is adequate after each Coder execution; the loop extends or repairs the plan (via a separate Router agent) on reject and exits on approval or at 10 rounds. The primitive is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" — the judge is the only feasible structural check. Empirical round-count distribution on DABStep: 3.0 avg rounds on easy tasks (>50 % single-round) vs 5.6 on hard tasks; state-of-the-art accuracy (45.2 % on DABStep vs 41.0 % prior best). This extends the wiki's LLM-as-judge framing along a new axis — in the loop that generates the thing being judged, co-evolving with the plan rather than scoring a fixed artefact. Canonical realisation is patterns/planner-coder-verifier-router-loop; loop-level concept is concepts/iterative-plan-refinement; termination discipline is concepts/refinement-round-budget.
sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — coordinator judge-pass edition: the judge is the coordinator agent itself consolidating output from N specialised sub-reviewers. After the seven sub-reviewers return structured XML findings, the coordinator (running on the top model tier — Opus 4.7 / GPT-5.4) performs a judge pass that: (1) deduplicates findings surfaced by multiple reviewers; (2) re-categorises — a perf bug flagged by code-quality moves to the perf section; (3) applies a reasonableness filter dropping speculative issues, nitpicks, false positives, convention-contradicted findings; (4) when uncertain, uses its tools to read the source code and verify before emitting. Distinguishes from prior LLM-as-judge instances because the judge is (a) wrapped inside a production pipeline with GitLab actions as its downstream effects, (b) scored in production by how often humans invoke break glass (0.6% of MRs), and (c) responsible for writing a natural-language verdict that ships as the single GitLab review comment. Canonical wiki instance of the coordinator- pattern variant of LLM-as-judge (see patterns/coordinator-sub-reviewer-orchestration).
sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — structured-extraction production instance. Instacart PARSE runs LLM-as-judge in two places: (a) development-mode auto-eval on a small sample alongside human labelers to let prompt authors iterate between human-eval cycles without blocking on them; (b) production-mode drift detection on periodic random samples of live extractions (paired with human auditors) to catch quality regressions on newly onboarded products. Orthogonal to the per-extraction self-verification confidence score — the two cover different failure modes.
sources/2026-02-19-lyft-scaling-localization-with-ai — machine-translation in-loop judge instance. Lyft's AI localization pipeline uses the Evaluator agent as an in-loop judge gating each translated string: grades 3 Drafter-produced candidates on a 4-dim rubric (accuracy/clarity, fluency/ adaptation, brand alignment, technical correctness), picks best on any-pass or feeds per-candidate critique back to the Drafter on all-fail (up to 3 attempts). Canonical on-wiki articulation of the four architecture-determining reasons to separate generator from judge instance: easier eval, context preservation on the generator side, [[concepts/self- approval-bias|self-approval bias]] avoidance, tier- differentiated model choice. Canonical text-translation sibling of the image (PIXEL), agent-plan (DS-STAR), and structured- extraction (PARSE) in-loop-judge instances.
sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — search-relevance pre-launch-validation instance. Zalando's Search & Browse team ships Search Quality Framework: an offline patterns/llm-as-judge-for-search-quality pipeline that scores (query, product) pairs on a graded 0–4 relevance rubric using GPT-4o with product data + images as visual-text context (concepts/visual-text-relevance-judgment). Applied to pre-launch market validation for Zalando's 2025 Luxembourg / Portugal / Greece launches — a setting where click-based quality signals are by definition unavailable. The test set is constructed by NER-clustered sampling from existing-market production traffic with LLM-translation to the target language; segment-level aggregates (patterns/segment-level-relevance-dashboard) surface three named failure classes (incorrect product attributes; unrecognised NER terms; undiscoverable brand categories). Cost: ~$250 per full run — 1,500 segments × 25 results × 3 markets; 3–5 hours. Judge's generalised reasoning (no per-attribute prompts) contrasts with Netflix Synopsis Judge's per-criterion judges — search relevance is a one-dimensional axis where Netflix's creative-quality axes are four-dimensional. Canonical wiki instance of LLM-as-judge in the search-relevance domain with visual-text context.
sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — multi-level-rubric variant, extending the per-trajectory multi-dimensional rubric into a three-level hierarchy matching the artefact's own structure. Instacart's generative recommendations platform (Shopping Hub rebuild) runs LLM-as-judge at page / placement / product levels, each with its own rubric (see patterns/llm-as-judge-multi-level-rubric). Key architectural insight from this instance: "LLM-as-a-judge evaluators are a powerful tool… it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale." That cost-at-scale limitation motivates the companion [[patterns/fine-tuned-cross-encoder-as-filter|fine-tuned DeBERTa cross-encoder]] at >99% cheaper than LLM inference, running on every candidate rather than a sample — making evaluation become action. Canonical wiki instance of the action-vs-measurement tension in LLM- as-judge at full-catalog scale.
sources/2026-05-11-databricks-unlocking-the-archives — inline-pipeline-stage edition. The Databricks-for-Good groundwater-archive pipeline embeds the judge as a first-class pipeline stage gating every document classification produced by the upstream multimodal classifier. Same primitive (ai_query) for both producer and judge — different models, same SQL surface. Rubric: accuracy / completeness / consistency "checking whether the classifications are supported by what the model actually observed." Output: categorical rating (excellent / good / fair / poor) plus a written justification — the justification column doubles as audit trail. Sub-threshold documents route to manual review; "in the first full run, only a small fraction of classifications required human attention." First-run result: 95% rated excellent/good across 5,570 pages / 654 documents. Distinguishes from prior wiki instances by being a production runtime gate, not an offline / CI eval — every output of every run flows through the judge before reaching downstream consumers (MapAid's WellMapr groundwater prediction models).

patterns/snapshot-replay-agent-evaluation — the replay harness that judges plug into.
systems/mlflow — hosts the judges primitive at Databricks.
systems/storex — production consumer of the pattern.
systems/bits-ai-sre-eval-platform — Datadog's productionised eval platform applying judge-style scoring at both ends of the pipeline.
patterns/agent-assisted-label-validation — up-funnel judge application: scoring labels, not just outputs.
concepts/trajectory-evaluation — the scoring shape for agent traces, not just final answers.
concepts/pass-at-k — complementary metric for quantifying non-determinism.
concepts/nmse-normalized-mean-squared-error — specific alignment metric used at Dash for DSPy optimisation.
concepts/structured-output-reliability — co-equal operational axis to alignment when the judge is consumed programmatically.
patterns/cross-model-prompt-adaptation — retarget a judge across different models via DSPy.
patterns/instruction-library-prompt-composition — constrained DSPy optimisation for high-stakes production judges.
patterns/tool-decoupled-agent-framework — the iteration loop that makes judge scoring valuable.
patterns/planner-coder-verifier-router-loop — the plan-sufficiency variant: judge sits inside the generation loop, gating add-or-fix plan refinement.
concepts/iterative-plan-refinement — loop-level framing the plan-sufficiency judge plugs into.
concepts/refinement-round-budget — termination discipline for judge-gated loops.
systems/ds-star — canonical in-loop-judgment instance.
concepts/vlm-as-image-judge — multimodal sibling — same "one model scores another model's output against a rubric" structure applied to images. Instacart PIXEL's 20% → 85% human-judge approval rate uplift is the canonical production instance.
patterns/vlm-evaluator-quality-gate — the image-output sibling of judge-gated refinement.
systems/instacart-pixel — canonical image-generation instance of the judge-in-loop pattern.
systems/instacart-parse — canonical structured-extraction instance: LLM-as-judge runs in both development mode (auto-eval to unblock prompt iteration) and production mode (periodic random-sample drift detection) alongside human auditors. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
concepts/llm-self-verification — the self-judge special case: the extracting model scores its own output via entailment-prompt + yes-token logit.
patterns/human-in-the-loop-quality-sampling — the periodic-sample drift-detection loop inside which LLM-as-judge multiplies reviewer coverage.
patterns/drafter-evaluator-refinement-loop — the text-translation in-loop-judge sibling pattern (Lyft's AI localization pipeline).
patterns/multi-candidate-generation — the N-candidates subroutine the Evaluator selects over.
concepts/self-approval-bias — the failure mode the generator-vs-judge separation mitigates.
concepts/machine-translation-with-llms — the task context in which Lyft's in-loop judge operates.
systems/lyft-ai-localization-pipeline — canonical text-translation production instance.
systems/lace-instacart — chatbot-evaluation production instance: multi-agent debate + binary rubric + reasoning/output decouple + human-alignment loop.
patterns/multi-agent-debate-evaluation — Customer + Support + Judge three-agent evaluation engine (Instacart LACE).
patterns/self-reflection-llm-evaluation — the two-pass single-agent variant Instacart benchmarked before picking debate.
patterns/human-aligned-criteria-refinement-loop — bootstrap + regression-test pattern for LLM-as-judge rubrics.
concepts/binary-vs-graded-llm-scoring — the rubric-shape choice Instacart measured in favour of binary.
concepts/decouple-reasoning-from-structured-output — the free-form-reasoning → formatter two-pass design.
concepts/llm-evaluation-dimensions — rubric-dimension design (5 dimensions + 3 complexity tiers at LACE).
concepts/human-llm-evaluation-alignment — the calibration concept behind the refinement loop.
concepts/stratified-topic-sampling — the production-monitoring sampling strategy paired with judges in long-tailed traffic.
sources/2026-04-21-meta-modernizing-facebook-groups-search — Meta Groups Scoped Search — Llama 3 multimodal judge in the BVT pipeline. Canonical wiki instance of the LLM-judge-in-BVT pattern: Meta integrates a Llama 3 multimodal judge directly into the build-verification-test CI stage, grading search results on a graded "exact-match / somewhat-relevant / irrelevant" rubric before any build advances to production. First Meta-authored LLM-judge-in-CI instance on the wiki; stronger operational stance than offline-eval-leaderboard variants (Zalando / Instacart) — judge verdict is a build-gate, not only a training signal. Motivation: "validate quality at scale without the bottleneck of human labeling."

LLM as Judge¶

Why it matters¶

Typical usage¶

Tradeoffs / gotchas¶

Seen in¶

Related¶