PATTERN Cited by 1 source

NER-clustered Query Sampling from Production¶

Intent¶

Construct a representative search-QA test set by sampling production queries clustered by NER-tag set (intent), ranked by traffic share, and optionally LLM-translated into a target language. The output is a scenario-balanced test set whose coverage reflects real-user intent diversity without the author bias of handcrafted cases.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The pipeline¶

Production search events
    → event-bus (Nakadi at Zalando)
    → Data Lake (persisted for OLAP processing)
        │
        ▼
  Cluster queries by NER-tag set
    ("Winter boot" / "Boot for winter"  →  same scenario)
        │
        ▼
  Rank clusters by traffic share
    (top-N most-trafficked intents first)
        │
        ▼
  For each selected cluster, for each target language:
    if source lang ≠ target lang:
        LLM-translate all member queries
    else:
        pass through
        │
        ▼
  Emit per-target-language test set, tagged with NER set

Structural properties¶

Coverage is intent-shaped, not lexical. Paraphrases collapse; long-tail intents survive sampling.
Per-scenario traffic share outweighs per-query frequency. A popular scenario with fifty paraphrase variants gets its share summed, not its most-frequent paraphrase.
Scenario identity is the NER tag set. This is what makes downstream cross-language work possible — translation preserves intent at the scenario level.

Zalando's concrete knobs¶

N = 1,500 top-traffic-share segments per market.
Ranking basis = traffic share from existing markets (not target market — target market hasn't launched yet).
Translation policy = LLM translation for language novelty; English/French direct-reuse for markets where those languages already cover Zalando-existing markets (e.g. Luxembourg).
Override hook for handcrafted additions "to add or customise the test cases if we need to" — not described in detail.

Benefits¶

Zero-author-bias test set. Test coverage is empirical.
Cheap to refresh. Re-runnable as production traffic distribution shifts.
Cross-market transferable. Same pipeline handles Luxembourg, Portugal, Greece without per-market hand-curation.
Clustering axis doubles as diagnostic axis. The NER-tag set that clusters queries is also the segment for downstream aggregation (patterns/segment-level-relevance-dashboard).

Risks¶

Target-language NER coverage cliff. Clusters for intents that don't tag cleanly in the target language won't form — which masks some low-priority intents entirely. The sidecar patterns/translated-query-ner-parity-check is what makes this observable.
Traffic-share bias. Popular intents dominate; rare-but- important intents under-covered. Can be corrected with long-tail-explicit supplementary sampling.
Scenario granularity is a tuning knob. Too-coarse NER tagging merges distinct intents; too-fine splits paraphrases. No standard threshold.

Variations¶

Error-biased sampling. Sample more from scenarios with historical low CTR / high abandonment (for markets where that data exists).
Long-tail-supplemented sampling. Top-N by traffic share
uniform-sample from long tail.
Per-category stratified sampling. Ensure every top-level category is represented regardless of traffic share.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's search-quality framework samples 1,500 NER-tag segments per market per run.