Skip to content

Zalando — Search Quality Assurance with AI as a Judge

Summary

Zalando's Search & Browse team describes the offline framework they built to validate search quality before launching into a new country with no prior user data. In 2025 Zalando launched its fashion store into Luxembourg, Portugal, and Greece — markets with no historical CTR / search logs and, in two cases, new languages. The team replaced the legacy human-expert + manual-annotation process with a data-driven LLM-as-a-judge pipeline: sample production queries from existing markets, cluster them by NER tags to capture search intent rather than surface form, translate the representative queries to the target language with an LLM, execute them against the pre-launch search stack, and score the result sets with a multi-modal (visual-text) LLM judge producing per-result 0–4 relevance scores. Aggregate scores per NER-tag segment surface three named failure classes (incorrect product attributes / unrecognised terms / undiscoverable products-or-categories) as ranked diagnostic signals before real users ever see them. The pipeline is built on Apache Airflow with one TaskGroup per market running in parallel, a Kubernetes PodOperator for each evaluation job, and a shared ElastiCache (query, product) cache collapsing 5000 × 25 → N expensive product-API + LLM calls where N = |unique products|. One run: ~$250 (GPT-4o completion cost dominant), 3–5 hours, 1,500 search segments × 25 results per market. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Key takeaways

  1. Pre-launch markets defeat click-based search QA. For an already-live market, low-CTR queries + search logs are the primary defect signal; for a new market those signals don't exist yet. Zalando's framing: "these signals are by definition not there yet. We need a more proactive approach that ensures quality before launch." This makes pre-launch search quality an LLM-as-judge-shaped problem — the judge scores a static (query, result-set) pair against a rubric without requiring observational behaviour data. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  2. Cluster by NER tags, not query text. "Winter boot" and "Boot for winter" are lexically different but share the same search intent. Zalando passes every production query through its NER engine to extract attributes (category / brand / colour / size / season / occasion / material), and groups queries by NER-tag set. Each group represents one search scenario; sampling by group captures scenario diversity without brute-force top-N. This is concepts/ner-clustered-query-sampling — the canonicalising pattern. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  3. Translate whole scenarios, not individual queries. The LLM-translation step takes a group of paraphrase queries that share NER tags and produces a target-language set that preserves the same intent. "This enables us to reuse search scenarios from existing markets for new markets with different languages, while having translated scenarios keep the same search intents." The NER-tag set is the stable identity across languages — translation is a per-scenario operation, not a per-query one. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  4. The judge is visual-text, generalised, prompt-free per attribute. Evaluation context = product data + product images; relevance scoring is instructed by a clear 0–4 scale rubric (4 = perfect match, 0 = completely wrong / irrelevant). "The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images." — the judge's reasoning is rubric-driven, not feature-engineered. Production model is GPT-4o during the pre-market launch process. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  5. NER-tag segment is the diagnostic unit. Per-query scores are noisy; aggregating by NER-tag segment produces a stable signal the engineer can triage. Three named failure classes surface as different segment-level patterns:

  6. Incorrect product attributes/data → multiple similar- meaning NER-tag segments consistently low;
  7. Unrecognised terms/attributes by NER → target-language query NER tags disagree with source-language NER tags for the same scenario;
  8. Undiscoverable products/categories → multiple brand-scoped segments (e.g. BRAND=foo CATEGORY=yoga, BRAND=foo CATEGORY=leggings, ...) all low together → product-data quality issue on that brand. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  9. (query, product) cache collapses N^2 cost to N. A naive implementation issues 5000 queries × 25 results = 125,000 product-API + LLM calls. Because result sets overlap, Zalando caches on (query, product) pairs: "Instead of calling Product API (5000 × 25) times for 5000 search queries with 25 results, we only need to call it N times where N is the number of unique products in all search results. This N does not scale as much as the number of search queries increases." The cache is scoped to the evaluation task group (ElastiCache) — not shared with production search. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  10. One Airflow TaskGroup per market, PodOperator per job. Zalando orchestrates multiple market evaluations in parallel by placing each market's evaluation lineage in its own TaskGroup; a final consolidation task aggregates results once all TaskGroups finish. Each stage (test-query generation, search-result retrieval, LLM evaluation) runs via PodOperator on Zalando's Kubernetes cluster — keeping DAG code clean and encapsulating evaluation logic + dependencies in the Docker image. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  11. Cross-language NER parity is a first-class failure mode. "The same query should have the same NER tags by meaning across different languages." When a translated query's NER tags disagree with the original's, the NER engine is missing vocabulary in the new language. Disclosed Portuguese / Greek/Spanish-derived examples:

  12. CATEGORY=desporto (sport) — lemmatisation issues on "desporto", "desportivo", "desportiva";
  13. CATEGORY=zapatilhas"tenis", "ténis" (sneaker, PT) collides with "tennis" the sport;
  14. GENDER=mulher CATEGORY=menina"menina", "meninas" unrecognised → mixed-gender result sets;
  15. CATEGORY=fato de treino (tracksuit, PT) — multi-word category unrecognised → zero sport/tracksuit results. These are not search-relevance bugs in isolation; they are NER-vocabulary bugs diagnosed through judge-measured relevance collapse in the downstream segments. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  16. Cost and latency disclosure. "The cost per one full run nets around 250 USD, which mainly comes from GPT-4o completion API cost." 1,500 segments × 25 results per market × 3 markets; 3–5 hours per run. Explicit framing: "very cost efficient for the scale of 1,500 search segments with 25 results each. Especially so when considering the alternative of human evaluation, which also would take days." The repeatability claim: "With this setup, we can re-evaluate our search quality as many times as we want" — the judge is a non-degrading quality gate, not a one-shot audit. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

  17. Existing-market regression as a second-order benefit. "Finally, we can now also perform automated in depth validation of existing markets, which enables us to proactively identify regressions and otherwise uncaught issues." The pre-launch pipeline generalises to regression monitoring of already-live markets — post-launch, the same judge cheaply detects ranker / NER / catalog regressions that click-based A/B signals would only surface weeks later or miss entirely in low-volume segments. (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Architecture at a glance

┌────────────────────────────────────────────────────────────┐
│  Production search events  →  Nakadi event bus              │
│  → Zalando Data Lake (OLAP processing)                      │
└─────────────────────────┬──────────────────────────────────┘
             ┌────────────┴────────────┐
             │  Test query generation   │   (Airflow PodOperator)
             │  ─ NER cluster by tags   │
             │  ─ rank by traffic share │
             │  ─ LLM-translate top-N   │
             └────────────┬────────────┘
             ┌────────────┴────────────┐
             │  Search result retrieval │   (Airflow PodOperator)
             │  ─ submit queries to     │
             │    search microservice   │
             │  ─ cache results in mem  │
             └────────────┬────────────┘
             ┌────────────┴────────────┐
             │  LLM-as-judge evaluation │   (Airflow PodOperator)
             │  ─ visual-text GPT-4o    │
             │  ─ per-result 0–4 score  │
             │  ─ cache (q, p) → score  │◄── ElastiCache
             └────────────┬────────────┘
             ┌────────────┴────────────┐
             │  NER-analyser (parity)   │   (Airflow PodOperator)
             │  ─ source tags vs        │
             │    translated tags       │
             └────────────┬────────────┘
             ┌────────────┴────────────┐
             │  Evaluation report data  │
             │  ─ per-segment avg score │
             │  ─ NER-mismatch diff     │
             └──────────────────────────┘

Parallel TaskGroups: one per market (LU / PT / GR / …)
Final consolidation task aggregates all TaskGroup outputs.

Systems extracted

  • systems/zalando-search-quality-framework (new) — the LLM-as-a-judge evaluation framework itself; the end-to-end Airflow pipeline + PodOperator-packaged evaluation code + ElastiCache (query, product) cache.
  • systems/zalando-search-query-clustering (new) — the upstream NER-clustering + LLM-translation + traffic-share- ranked test-query generator. One of two Airflow TaskGroups in the canonical pipeline.
  • systems/zalando-catalog-search — the search stack under test; pre-launch validation exercises the full substrate for the new market.
  • systems/zalando-ner-query-builder — the NER engine whose tags both cluster input queries and surface cross-language vocabulary gaps as the diagnostic readout.
  • systems/zalando-search-api — the presentation-layer search microservice the PodOperator queries.
  • systems/zalando-base-search — Elasticsearch cluster eventually executing the pre-launch queries.
  • systems/apache-airflow — orchestrator; TaskGroup-per- market parallelism + PodOperator-per-stage encapsulation.
  • systems/aws-elasticache — shared (query, product) and (query, product) → relevance-score cache, scoped to the evaluation tasks.
  • systems/gpt-4o — the judge model during pre-market launch. Multi-modal (visual + text) input for product data + images.
  • systems/nakadi — the event-bus carrying production search traffic into the Data Lake.
  • systems/kubernetes — substrate for PodOperator-scheduled evaluation containers.

Concepts extracted

Patterns extracted

Operational numbers

Dimension Value Source
Markets validated 3 (Luxembourg / Portugal / Greece) post
Search segments per market per run 1,500 post
Results per segment scored 25 post
Cost per full run ~$250 USD post
Runtime per run 3–5 hours post
Judge model GPT-4o post
Dominant cost driver GPT-4o completion API post
Eval cache scope per-evaluation, ElastiCache post
Cache dedup key (query, product) pair post
Orchestrator Apache Airflow post
Parallelism axis TaskGroup per market post
Compute substrate Kubernetes via PodOperator post
Event pipeline production search events → Nakadi → Data Lake post
Relevance scale 0–4 (4 = perfect match) post

Caveats and gaps

  • Human-calibration protocol undisclosed. Unlike Netflix's Synopsis Judge or Dropbox Dash's relevance-judge work, this post does not describe a golden set the LLM judge is calibrated against, nor cite inter-rater-agreement numbers vs human annotators. The referenced paper (arXiv:2409.11860) is cited as the source for both the judging methodology and "more details of the LLM vs. human cost for annotation", but this blog post does not quote human-alignment headline numbers.
  • No prompt / APO / consensus-sampling details. Whether tiered rationales, consensus scoring, per-criterion judges, or automatic prompt optimisation are used is unstated. The claim "The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images" suggests a single rubric-driven prompt, not the per-criterion specialisation Netflix uses.
  • Per-market accuracy numbers absent. The post shows qualitative Portuguese and Greek diagnostic tables but no per-language accuracy headline — contrast with Netflix's 86.55% → 87.85% on tone or Dropbox Dash's NMSE.
  • Judge-vs-human cost ratio. Post names "days" for human evaluation and $250 for one LLM-judge run, but no quantified ratio; the referenced paper is the better datum.
  • Rollout to existing-market regression detection is framed as future-tense ("we can now also perform") — no quantified regression-detection yield disclosed yet.
  • No disclosure on how low-scoring segments are acted on beyond three named failure classes. No mention of automated ticketing, alerting thresholds, launch-blocking gates, or target-score requirements before launch approval.
  • Luxembourg shortcut disclosed. "For some markets like Luxembourg, we can directly use the English and French queries from our existing markets without a translation." — the translation step is conditional on target-language novelty.
  • Scale of NER-tag segment space undisclosed. 1,500 segments per market per run is a sample; the total number of NER-tag-defined scenarios across Zalando's catalogue is not quoted.

Source

Last updated · 507 distilled / 1,218 read