Skip to content

PATTERN Cited by 1 source

NER-clustered Query Sampling from Production

Intent

Construct a representative search-QA test set by sampling production queries clustered by NER-tag set (intent), ranked by traffic share, and optionally LLM-translated into a target language. The output is a scenario-balanced test set whose coverage reflects real-user intent diversity without the author bias of handcrafted cases.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The pipeline

Production search events
    → event-bus (Nakadi at Zalando)
    → Data Lake (persisted for OLAP processing)
  Cluster queries by NER-tag set
    ("Winter boot" / "Boot for winter"  →  same scenario)
  Rank clusters by traffic share
    (top-N most-trafficked intents first)
  For each selected cluster, for each target language:
    if source lang ≠ target lang:
        LLM-translate all member queries
    else:
        pass through
  Emit per-target-language test set, tagged with NER set

Structural properties

  • Coverage is intent-shaped, not lexical. Paraphrases collapse; long-tail intents survive sampling.
  • Per-scenario traffic share outweighs per-query frequency. A popular scenario with fifty paraphrase variants gets its share summed, not its most-frequent paraphrase.
  • Scenario identity is the NER tag set. This is what makes downstream cross-language work possible — translation preserves intent at the scenario level.

Zalando's concrete knobs

  • N = 1,500 top-traffic-share segments per market.
  • Ranking basis = traffic share from existing markets (not target market — target market hasn't launched yet).
  • Translation policy = LLM translation for language novelty; English/French direct-reuse for markets where those languages already cover Zalando-existing markets (e.g. Luxembourg).
  • Override hook for handcrafted additions "to add or customise the test cases if we need to" — not described in detail.

Benefits

  • Zero-author-bias test set. Test coverage is empirical.
  • Cheap to refresh. Re-runnable as production traffic distribution shifts.
  • Cross-market transferable. Same pipeline handles Luxembourg, Portugal, Greece without per-market hand-curation.
  • Clustering axis doubles as diagnostic axis. The NER-tag set that clusters queries is also the segment for downstream aggregation (patterns/segment-level-relevance-dashboard).

Risks

  • Target-language NER coverage cliff. Clusters for intents that don't tag cleanly in the target language won't form — which masks some low-priority intents entirely. The sidecar patterns/translated-query-ner-parity-check is what makes this observable.
  • Traffic-share bias. Popular intents dominate; rare-but- important intents under-covered. Can be corrected with long-tail-explicit supplementary sampling.
  • Scenario granularity is a tuning knob. Too-coarse NER tagging merges distinct intents; too-fine splits paraphrases. No standard threshold.

Variations

  • Error-biased sampling. Sample more from scenarios with historical low CTR / high abandonment (for markets where that data exists).
  • Long-tail-supplemented sampling. Top-N by traffic share
  • uniform-sample from long tail.
  • Per-category stratified sampling. Ensure every top-level category is represented regardless of traffic share.

Seen in

Last updated · 507 distilled / 1,218 read