SYSTEM Cited by 1 source

Zalando Search Query Clustering¶

Identity¶

Search Query Clustering is the upstream test-query generator that feeds Zalando's Search Quality Framework. Its job is to produce, per target market, a representative NER-tag-segmented set of test queries sampled from existing-market production traffic, translated where necessary, and ranked by traffic share.

It is the first Airflow TaskGroup stage of the pipeline; everything downstream (result retrieval, LLM judging) operates on its output.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Why it exists¶

"We need to draw sample test queries from search data which cover as wide range of search scenarios as possible. We should not only take most N frequent query terms, as different forms of queries may mean the same thing, e.g. 'Winter boot' and 'Boot for winter' are essentially the same search intent. They should belong together and be counted as the same search scenario. Therefore we need a good clustering approach."

Raw top-N frequency sampling over-counts paraphrases of the same search intent and under-covers rarer but distinct scenarios. The clustering axis has to be search intent, not lexical form — which is what the NER engine's extracted tag sets already provide.

Inputs¶

Production search events. Zalando's search infrastructure publishes every processed search request through Nakadi (its RESTful event bus built on Kafka). Continuously-running Data-Lake pipelines "consume these event streams; process the data; and persist the data into a Data Lake for analysis, reporting, and archival". Search Query Clustering runs on top of this Data Lake.
NER-tag extractions already attached to each query from the production NER engine — no extra NER inference at clustering time for existing-market queries.

Pipeline stages¶

Cluster by NER-tag set. Group queries sharing the same NER-tag attributes (e.g. {category: kids, type: jacket, season: winter}) into a single scenario. All three of "Kids Winter Jacket", "Winter Jackets for Kids", "Kids Jackets Winter" map to one scenario.
Rank by traffic share. "Since search intent distribution differs across markets, we select the top N search groups by NER tags (representing search intent/topic) ranked by traffic share." Traffic-share weighting ensures the sampled segments reflect where users actually spend time, not just where the long-tail distribution lives.
LLM translation (conditional). For any scenario whose source language does not match the target market's language, translate the paraphrase queries with an LLM. The NER-tag set is the stable identity across languages — translation is scenario-scoped, not per-query. "For some markets like Luxembourg, we can directly use the English and French queries from our existing markets without a translation."
Emit test-query set. Per-market, per-run: N scenarios (1,500 in the disclosed production case), each with M paraphrase queries, tagged with their NER-tag set for downstream segment-level aggregation.

Disclosed scale¶

1,500 search segments per market per run. "We tested 3 markets with the 1,500 most searched segments in each of them."
Three markets in parallel per run in the 2025 launches (Luxembourg, Portugal, Greece).

Example translation output (from the post)¶

NER tags	EN paraphrase queries	PT translated
`category: kids, type: jacket, season: winter`	Kids Winter Jacket, Winter Jackets for Kids, Kids Jackets Winter	Jaqueta de Inverno Infantil, Jaquetas de Inverno para Crianças, Jaquetas Infantis de Inverno
`category: shoes, brand: nike, type: sneakers`	Nike Sneakers, Nike Shoes, Nike Sneaker	Nike Sapatilhas, Nike Sapatos, Nike Sapatilha
`category: dress, occasion: party, color: black`	Black Party Dress, Party Dresses Black, Black Dress for Party	Vestido de Festa Preto, Vestidos de Festa Pretos, Vestido Preto para Festa

Output coupling to the NER-analyser¶

A sibling Airflow task — the NER analyser — consumes both the source-language queries and the translated queries and runs them through the NER engine for the target language. "This allows us to compare the NER tags of the original search query and the translated search query, and identify inconsistencies that can lead to search issues, such as missing tags or incorrectly tagged attributes in the new language."

The NER-parity comparison is its own diagnostic signal — not part of relevance scoring, but run in parallel against the same translated query set.

Known gaps¶

Exact N / M — the source says "top N" without quoting the clustering-granularity threshold that defines what counts as "the same" NER-tag set (stemming? synonym merging?).
Traffic-share denominator scope — whether shares are per- market-of-origin or per-aggregated-corpus isn't stated; for markets sharing languages the choice matters.
Translation-model identity not disclosed. "We can translate these queries to other languages using an LLM" — no model name.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Figure 1 in the post is this pipeline's generation flow.

systems/zalando-search-quality-framework — the downstream consumer.
systems/zalando-ner-query-builder — the tag-producing engine.
systems/zalando-catalog-search — the system whose traffic is being sampled.
systems/nakadi — event-bus carrying production search events.
systems/apache-airflow — orchestrator.
systems/gpt-4o — plausible translation model (not explicitly named).
concepts/ner-clustered-query-sampling
concepts/automated-test-generation-from-production-traffic
concepts/translated-query-parity
patterns/ner-clustered-query-sampling-from-production
patterns/translated-query-ner-parity-check
companies/zalando