Skip to content

SYSTEM Cited by 1 source

Zalando Search Quality Framework

Identity

The Search Quality Framework is Zalando's offline LLM-as-a-judge evaluation pipeline for the catalog-search substrate. Its canonical production role is pre-launch market validation: given a target market with no prior user data, it produces per- segment relevance scores that surface search defects weeks or months before real users would see them.

It is the evaluation-side companion to the serving-side catalog-search stack. It does not run in the online request path — it is an Airflow-orchestrated offline batch that queries production search microservices as a black-box client.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

What it replaces

Before this framework, pre-launch search QA was "heavily reliant on human experts and a manual process":

"Due to the fact that we do not know which search queries may work well or not in the new markets because they are not live yet, we have to draw sample search queries from the existing markets, and translate them if the new market is operating in different language and test the search system manually. Human experts have to annotate error cases, and identify cases where search returns poor quality results."

Zalando's own diagnosis: "Not only is this process not scalable, but it is also reactive by nature, meaning that issues are only identified after features are launched and users have already experienced them, since we rely on signals coming from real users such as low CTR. For an entirely new country, these signals are by definition not there yet."

Architecture

Three Airflow stages, each packaged as a Docker image and scheduled via KubernetesPodOperator:

  1. Test query generation — upstream Search Query Clustering pipeline produces per-market, NER-tag-segmented test queries, ranked by traffic share, with LLM-translation applied to any novel target language.
  2. Search result retrieval — the PodOperator runs through the test queries and submits them to the Search API microservice (or the market-specific Base Search entry point under test), retrieving the top-25 result set per query.
  3. LLM evaluation — the PodOperator submits every (query, result) pair to GPT-4o with product data + images as evaluation context; the judge returns a 0–4 relevance score per result under a clear rubric (4 = perfect match, 0 = completely wrong / irrelevant).

A fourth Airflow task analyses NER-tag agreement between source- language and target-language queries (concepts/ner-tag-parity-across-languages) — a structural diagnostic running in parallel with the relevance-scoring stage.

Market parallelism is achieved by placing each market's three- stage lineage inside its own TaskGroup, all consolidating into a final aggregation task. (concepts/airflow-taskgroup-parallelism.)

Evaluation cache

A shared ElastiCache instance, accessible only to the evaluation tasks, stores:

  • (query, product) → product-data / image fetches from the Product API.
  • (query, product) → relevance score emissions from GPT-4o.

Both are (query, product) evaluation cache entries. The cost collapse is quoted directly: "Instead of calling Product API (5000 × 25) times for 5000 search queries with 25 results, we only need to call it N times where N is the number of unique products in all search results. This N does not scale as much as the number of search queries increases."

Because the cache is scoped to evaluation (not shared with production catalog-search caches), it cannot mask live serving misbehaviour — the judge sees the production result set as it is.

Output: segment-level relevance report

The framework's output is a per-segment aggregate relevance score plus per-result breakdowns. A segment is one NER-tag set (e.g. CATEGORY=desporto, BRAND=foo CATEGORY=yoga). Three named failure classes surface as different segment-level patterns:

  1. Incorrect product attributes / data — product categories with incorrect attributes fail to surface despite query variations; multiple similar-meaning NER-tag segments consistently low.
  2. Unrecognised terms / attributes by NER — the NER-analyser task identifies unrecognised terms; cross-checked with spell-correction and lemmatisation decisions in the new language.
  3. Undiscoverable products / categories — multiple brand-scoped segments sharing a brand tag all low together, indicating product-data quality issues on that brand.

Production deployment window

The post names GPT-4o as the judge "during pre-market launch process" — the framework was active at least through Zalando's 2025 Luxembourg / Portugal / Greece launches, with three markets × 1,500 segments × 25 results per run. Post-launch framing: "we can now also perform automated in depth validation of existing markets, which enables us to proactively identify regressions and otherwise uncaught issues" — same pipeline, now applied to live markets as a regression detector.

Cost and cadence

Knob Value
Cost per run ~$250 USD (GPT-4o completions dominant)
Runtime per run 3–5 hours
Segments per market per run 1,500
Results scored per segment 25
Markets validated in parallel 3 (LU / PT / GR)

One-time infrastructure cost (pipeline setup) vs re-runnable (no handcrafted test cases): "The investment to set up the infrastructure was a one-time cost, and no handcrafted test cases were necessary. With this setup, we can re-evaluate our search quality as many times as we want."

Known gaps (undisclosed in the source)

  • No golden-set or human-calibration numbers in the blog post — referenced paper (arXiv:2409.11860) is the expected source but not quoted.
  • No prompt-engineering details (tiered rationales, consensus scoring, per-criterion judges).
  • No launch-gate threshold — no quantified criterion like "average segment score ≥ X before we launch".
  • No disclosed per-language accuracy or inter-rater-agreement numbers.

Seen in

Last updated · 507 distilled / 1,218 read