PATTERN Cited by 1 source
LLM-as-Judge for Search Quality¶
Intent¶
Evaluate a search stack's relevance quality without relying on
user click signal by having a multi-modal LLM judge score
every (query, returned product) pair in a representative test
set on a graded rubric, and aggregating the scores up to a
segment- / market- / stack-level quality report.
The pattern replaces (or augments) click-based bucket tests and human-annotator panels as the short-loop signal on relevance regression and pre-launch quality.
(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
Structure¶
test queries (sampled, clustered, translated)
│
▼
┌───────────────────────┐
│ Query → Search API │──────► result set (top-K products)
└───────────────────────┘
│
▼
for each (query, product):
fetch product data + image (cached)
ask LLM judge:
"Score relevance on 0–4 scale"
record score (cached)
│
▼
aggregate by segment (NER-tag set) → per-segment avg
aggregate by market → per-market avg
Core design choices¶
- Multi-modal judge. Product image + product data as evaluation context. See concepts/visual-text-relevance-judgment.
- Graded rubric output (0–4), not binary. 4 = perfect match / 0 = completely wrong; intermediate grades capture partial relevance. Graded output lets segment aggregates be continuous rather than pass-rate-only.
- Generalised prompt, not per-attribute rubrics. "The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images."
- (query, product) dedup cache. See patterns/query-product-evaluation-cache.
- Segment-level aggregation. See patterns/segment-level-relevance-dashboard.
Zalando's concrete realisation¶
- Judge: GPT-4o during pre-market-launch.
- Test-set construction: production queries from existing markets, clustered by NER-tag set (concepts/ner-clustered-query-sampling), LLM-translated to the target language where needed.
- Scope: 1,500 segments × 25 results per market; 3 markets in parallel.
- Cost / time: $250 per run, 3–5 hours.
- Orchestration: Airflow TaskGroup-per-market + KubernetesPodOperator-per-stage. See patterns/per-market-parallel-taskgroup-dag and patterns/podoperator-encapsulated-evaluation-job.
When to reach for this pattern¶
- Pre-launch new-market / new-locale evaluation. No click signal available; LLM-judge is the structural substitute. See concepts/pre-launch-market-validation.
- Regression detection on low-traffic segments. Click statistics too noisy; offline LLM-judge run cheap enough to re-run daily.
- Counterfactual ranker evaluation. Before rolling out a new ranker, run the judge against both rankers' result sets for the same test queries.
When not to¶
- High-frequency, behaviourally-specific optimisation. If you have click / dwell / CTR at statistical significance, those are stronger signals than any judge.
- Subjective or multi-dimensional quality axes that don't collapse to one relevance scale — use per-criterion judges (à la Netflix Synopsis Judge) instead.
- Latency-critical online scoring. The pattern is offline; a re-ranker-at-request-time needs different economics.
Cost / quality knobs¶
- Which judge model. GPT-4o in 2026 is the reference; the tier below (multi-modal, smaller context) trades quality for $. Cost sensitivity is the primary upgrade driver.
- Dedup cache scope. Per-run vs persistent.
- Consensus / sampling. Zalando's description suggests single-shot judgment per pair; consensus scoring (patterns/consensus-scoring) would improve stability at N× cost — not described here.
- Per-criterion decomposition. Zalando uses one rubric for relevance; Netflix splits their judging into four criteria. Depends on whether your quality definition is one-dimensional.
Tradeoffs with human annotation¶
Zalando's framing: "Especially so when considering the alternative of human evaluation, which also would take days." The LLM-judge is faster and cheaper; the post cites the referenced paper (arXiv:2409.11860) for human-calibration numbers. The quality-vs-cost tradeoff vs human annotation is not quoted in this blog post but is the load-bearing assumption of the whole pattern's economic case.
Relation to other LLM-as-judge deployments¶
- Netflix Synopsis Judge — creative-quality domain (tone / clarity / precision / factuality), per-criterion specialisation, binary outputs. Different domain, different judge shape.
- Dropbox Dash relevance judge — retrieval relevance with graded scale + NMSE alignment + DSPy (GEPA/MIPROv2) optimisation. Closest sibling to Zalando's case on the relevance axis.
- Instacart PIXEL VLM image judge — in-loop image judge driving iterative prompt refinement (20% → 85% approval). Closest on the visual-text multi-modal axis but inside a generation loop, not an offline QA gate.
- Lyft AI localization Evaluator — drafter + evaluator multi-dim rubric. Same shape as Zalando's post-translation NER-parity sidecar at a different altitude.
Seen in¶
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's pre-launch market- validation framework for Luxembourg / Portugal / Greece.
Related¶
- concepts/llm-as-judge
- concepts/visual-text-relevance-judgment
- concepts/pre-launch-market-validation
- systems/zalando-search-quality-framework
- systems/gpt-4o
- patterns/segment-level-relevance-dashboard
- patterns/query-product-evaluation-cache
- patterns/per-market-parallel-taskgroup-dag
- patterns/podoperator-encapsulated-evaluation-job
- systems/netflix-synopsis-judge
- companies/zalando