Skip to content

SYSTEM Cited by 1 source

Zalando Search Query Clustering

Identity

Search Query Clustering is the upstream test-query generator that feeds Zalando's Search Quality Framework. Its job is to produce, per target market, a representative NER-tag-segmented set of test queries sampled from existing-market production traffic, translated where necessary, and ranked by traffic share.

It is the first Airflow TaskGroup stage of the pipeline; everything downstream (result retrieval, LLM judging) operates on its output.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Why it exists

"We need to draw sample test queries from search data which cover as wide range of search scenarios as possible. We should not only take most N frequent query terms, as different forms of queries may mean the same thing, e.g. 'Winter boot' and 'Boot for winter' are essentially the same search intent. They should belong together and be counted as the same search scenario. Therefore we need a good clustering approach."

Raw top-N frequency sampling over-counts paraphrases of the same search intent and under-covers rarer but distinct scenarios. The clustering axis has to be search intent, not lexical form — which is what the NER engine's extracted tag sets already provide.

Inputs

  • Production search events. Zalando's search infrastructure publishes every processed search request through Nakadi (its RESTful event bus built on Kafka). Continuously-running Data-Lake pipelines "consume these event streams; process the data; and persist the data into a Data Lake for analysis, reporting, and archival". Search Query Clustering runs on top of this Data Lake.
  • NER-tag extractions already attached to each query from the production NER engine — no extra NER inference at clustering time for existing-market queries.

Pipeline stages

  1. Cluster by NER-tag set. Group queries sharing the same NER-tag attributes (e.g. {category: kids, type: jacket, season: winter}) into a single scenario. All three of "Kids Winter Jacket", "Winter Jackets for Kids", "Kids Jackets Winter" map to one scenario.
  2. Rank by traffic share. "Since search intent distribution differs across markets, we select the top N search groups by NER tags (representing search intent/topic) ranked by traffic share." Traffic-share weighting ensures the sampled segments reflect where users actually spend time, not just where the long-tail distribution lives.
  3. LLM translation (conditional). For any scenario whose source language does not match the target market's language, translate the paraphrase queries with an LLM. The NER-tag set is the stable identity across languages — translation is scenario-scoped, not per-query. "For some markets like Luxembourg, we can directly use the English and French queries from our existing markets without a translation."
  4. Emit test-query set. Per-market, per-run: N scenarios (1,500 in the disclosed production case), each with M paraphrase queries, tagged with their NER-tag set for downstream segment-level aggregation.

Disclosed scale

  • 1,500 search segments per market per run. "We tested 3 markets with the 1,500 most searched segments in each of them."
  • Three markets in parallel per run in the 2025 launches (Luxembourg, Portugal, Greece).

Example translation output (from the post)

NER tags EN paraphrase queries PT translated
category: kids, type: jacket, season: winter Kids Winter Jacket, Winter Jackets for Kids, Kids Jackets Winter Jaqueta de Inverno Infantil, Jaquetas de Inverno para Crianças, Jaquetas Infantis de Inverno
category: shoes, brand: nike, type: sneakers Nike Sneakers, Nike Shoes, Nike Sneaker Nike Sapatilhas, Nike Sapatos, Nike Sapatilha
category: dress, occasion: party, color: black Black Party Dress, Party Dresses Black, Black Dress for Party Vestido de Festa Preto, Vestidos de Festa Pretos, Vestido Preto para Festa

Output coupling to the NER-analyser

A sibling Airflow task — the NER analyser — consumes both the source-language queries and the translated queries and runs them through the NER engine for the target language. "This allows us to compare the NER tags of the original search query and the translated search query, and identify inconsistencies that can lead to search issues, such as missing tags or incorrectly tagged attributes in the new language."

The NER-parity comparison is its own diagnostic signal — not part of relevance scoring, but run in parallel against the same translated query set.

Known gaps

  • Exact N / M — the source says "top N" without quoting the clustering-granularity threshold that defines what counts as "the same" NER-tag set (stemming? synonym merging?).
  • Traffic-share denominator scope — whether shares are per- market-of-origin or per-aggregated-corpus isn't stated; for markets sharing languages the choice matters.
  • Translation-model identity not disclosed. "We can translate these queries to other languages using an LLM" — no model name.

Seen in

Last updated · 507 distilled / 1,218 read