PATTERN Cited by 1 source
NER-clustered Query Sampling from Production¶
Intent¶
Construct a representative search-QA test set by sampling production queries clustered by NER-tag set (intent), ranked by traffic share, and optionally LLM-translated into a target language. The output is a scenario-balanced test set whose coverage reflects real-user intent diversity without the author bias of handcrafted cases.
(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
The pipeline¶
Production search events
→ event-bus (Nakadi at Zalando)
→ Data Lake (persisted for OLAP processing)
│
▼
Cluster queries by NER-tag set
("Winter boot" / "Boot for winter" → same scenario)
│
▼
Rank clusters by traffic share
(top-N most-trafficked intents first)
│
▼
For each selected cluster, for each target language:
if source lang ≠ target lang:
LLM-translate all member queries
else:
pass through
│
▼
Emit per-target-language test set, tagged with NER set
Structural properties¶
- Coverage is intent-shaped, not lexical. Paraphrases collapse; long-tail intents survive sampling.
- Per-scenario traffic share outweighs per-query frequency. A popular scenario with fifty paraphrase variants gets its share summed, not its most-frequent paraphrase.
- Scenario identity is the NER tag set. This is what makes downstream cross-language work possible — translation preserves intent at the scenario level.
Zalando's concrete knobs¶
- N = 1,500 top-traffic-share segments per market.
- Ranking basis = traffic share from existing markets (not target market — target market hasn't launched yet).
- Translation policy = LLM translation for language novelty; English/French direct-reuse for markets where those languages already cover Zalando-existing markets (e.g. Luxembourg).
- Override hook for handcrafted additions "to add or customise the test cases if we need to" — not described in detail.
Benefits¶
- Zero-author-bias test set. Test coverage is empirical.
- Cheap to refresh. Re-runnable as production traffic distribution shifts.
- Cross-market transferable. Same pipeline handles Luxembourg, Portugal, Greece without per-market hand-curation.
- Clustering axis doubles as diagnostic axis. The NER-tag set that clusters queries is also the segment for downstream aggregation (patterns/segment-level-relevance-dashboard).
Risks¶
- Target-language NER coverage cliff. Clusters for intents that don't tag cleanly in the target language won't form — which masks some low-priority intents entirely. The sidecar patterns/translated-query-ner-parity-check is what makes this observable.
- Traffic-share bias. Popular intents dominate; rare-but- important intents under-covered. Can be corrected with long-tail-explicit supplementary sampling.
- Scenario granularity is a tuning knob. Too-coarse NER tagging merges distinct intents; too-fine splits paraphrases. No standard threshold.
Variations¶
- Error-biased sampling. Sample more from scenarios with historical low CTR / high abandonment (for markets where that data exists).
- Long-tail-supplemented sampling. Top-N by traffic share
- uniform-sample from long tail.
- Per-category stratified sampling. Ensure every top-level category is represented regardless of traffic share.
Seen in¶
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's search-quality framework samples 1,500 NER-tag segments per market per run.