PATTERN Cited by 1 source

LLM-as-Judge for Search Quality¶

Intent¶

Evaluate a search stack's relevance quality without relying on user click signal by having a multi-modal LLM judge score every (query, returned product) pair in a representative test set on a graded rubric, and aggregating the scores up to a segment- / market- / stack-level quality report.

The pattern replaces (or augments) click-based bucket tests and human-annotator panels as the short-loop signal on relevance regression and pre-launch quality.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure¶

   test queries (sampled, clustered, translated)
               │
               ▼
   ┌───────────────────────┐
   │  Query → Search API    │──────► result set (top-K products)
   └───────────────────────┘
               │
               ▼
   for each (query, product):
       fetch product data + image  (cached)
       ask LLM judge:
          "Score relevance on 0–4 scale"
       record score                (cached)
               │
               ▼
   aggregate by segment (NER-tag set) → per-segment avg
   aggregate by market               → per-market avg

Core design choices¶

Multi-modal judge. Product image + product data as evaluation context. See concepts/visual-text-relevance-judgment.
Graded rubric output (0–4), not binary. 4 = perfect match / 0 = completely wrong; intermediate grades capture partial relevance. Graded output lets segment aggregates be continuous rather than pass-rate-only.
Generalised prompt, not per-attribute rubrics. "The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images."
(query, product) dedup cache. See patterns/query-product-evaluation-cache.
Segment-level aggregation. See patterns/segment-level-relevance-dashboard.

Zalando's concrete realisation¶

Judge: GPT-4o during pre-market-launch.
Test-set construction: production queries from existing markets, clustered by NER-tag set (concepts/ner-clustered-query-sampling), LLM-translated to the target language where needed.
Scope: 1,500 segments × 25 results per market; 3 markets in parallel.
Cost / time: $250 per run, 3–5 hours.
Orchestration: Airflow TaskGroup-per-market + KubernetesPodOperator-per-stage. See patterns/per-market-parallel-taskgroup-dag and patterns/podoperator-encapsulated-evaluation-job.

When to reach for this pattern¶

Pre-launch new-market / new-locale evaluation. No click signal available; LLM-judge is the structural substitute. See concepts/pre-launch-market-validation.
Regression detection on low-traffic segments. Click statistics too noisy; offline LLM-judge run cheap enough to re-run daily.
Counterfactual ranker evaluation. Before rolling out a new ranker, run the judge against both rankers' result sets for the same test queries.

When not to¶

High-frequency, behaviourally-specific optimisation. If you have click / dwell / CTR at statistical significance, those are stronger signals than any judge.
Subjective or multi-dimensional quality axes that don't collapse to one relevance scale — use per-criterion judges (à la Netflix Synopsis Judge) instead.
Latency-critical online scoring. The pattern is offline; a re-ranker-at-request-time needs different economics.

Cost / quality knobs¶

Which judge model. GPT-4o in 2026 is the reference; the tier below (multi-modal, smaller context) trades quality for $. Cost sensitivity is the primary upgrade driver.
Dedup cache scope. Per-run vs persistent.
Consensus / sampling. Zalando's description suggests single-shot judgment per pair; consensus scoring (patterns/consensus-scoring) would improve stability at N× cost — not described here.
Per-criterion decomposition. Zalando uses one rubric for relevance; Netflix splits their judging into four criteria. Depends on whether your quality definition is one-dimensional.

Tradeoffs with human annotation¶

Zalando's framing: "Especially so when considering the alternative of human evaluation, which also would take days." The LLM-judge is faster and cheaper; the post cites the referenced paper (arXiv:2409.11860) for human-calibration numbers. The quality-vs-cost tradeoff vs human annotation is not quoted in this blog post but is the load-bearing assumption of the whole pattern's economic case.

Relation to other LLM-as-judge deployments¶

Netflix Synopsis Judge — creative-quality domain (tone / clarity / precision / factuality), per-criterion specialisation, binary outputs. Different domain, different judge shape.
Dropbox Dash relevance judge — retrieval relevance with graded scale + NMSE alignment + DSPy (GEPA/MIPROv2) optimisation. Closest sibling to Zalando's case on the relevance axis.
Instacart PIXEL VLM image judge — in-loop image judge driving iterative prompt refinement (20% → 85% approval). Closest on the visual-text multi-modal axis but inside a generation loop, not an offline QA gate.
Lyft AI localization Evaluator — drafter + evaluator multi-dim rubric. Same shape as Zalando's post-translation NER-parity sidecar at a different altitude.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's pre-launch market- validation framework for Luxembourg / Portugal / Greece.