CONCEPT Cited by 1 source
NER-clustered Query Sampling¶
Definition¶
NER-clustered query sampling is the query-sampling discipline of grouping production search queries by their NER-extracted attribute set (category / brand / colour / size / season / …) rather than by surface text, so that paraphrases of the same search intent are counted as one scenario.
The clustering axis is search intent. The input is raw query text + its NER-tagged attributes; the output is a set of scenario groups, each representing one intent the sampling policy can then rank, translate, or filter over.
(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
The failure mode it corrects¶
Naive top-N frequency sampling treats lexical surface form as the clustering axis:
"We should not only take most N frequent query terms, as different forms of queries may mean the same thing, e.g. 'Winter boot' and 'Boot for winter' are essentially the same search intent. They should belong together and be counted as the same search scenario." (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
Frequency-top-N over-counts paraphrase variants of popular intents and under-counts rarer but distinct intents. For a QA test set this is actively harmful: coverage collapses onto a handful of dominant surface forms while long-tail scenarios (where defects actually hide) are excluded.
How it works¶
Zalando's production NER engine extracts attributes from each query: "product name, brand, colour, size, season, occasion, material, etc.". Queries with the same NER-tag set are placed in one cluster:
| NER-tag set | Member queries |
|---|---|
{category: kids, type: jacket, season: winter} |
Kids Winter Jacket, Winter Jackets for Kids, Kids Jackets Winter |
{category: shoes, brand: nike, type: sneakers} |
Nike Sneakers, Nike Shoes, Nike Sneaker |
{category: dress, occasion: party, color: black} |
Black Party Dress, Party Dresses Black, Black Dress for Party |
Sampling then operates on scenario groups, not individual queries — typically ranked by traffic share (the sum of traffic across all members of the group) so the sampled set reflects where real users spend time.
Properties that make this the right substrate¶
- Intent-stable across paraphrase. Wording variation collapses into one group; the test set doesn't double-count aliases.
- Intent-stable across languages. The tag set is the identity. Translating the members of the group preserves the scenario; this is what makes concepts/translated-query-parity possible.
- Inherits NER engine coverage limits. If the NER engine doesn't recognise an attribute in the target language, the group it should create for that intent won't exist — which is itself a diagnostic signal.
- Scales with tag vocabulary, not query volume. The number of unique NER-tag sets is bounded by the attribute-vocabulary cross-product, not by query-volume. Sampling from it is qualitatively different from sampling raw queries.
Limitations¶
- NER coverage is the ceiling. Unrecognised entities fail silently — queries with partial or empty NER tag sets cluster into a degenerate group that doesn't capture their intent. Target-language NER gaps surface as missing groups in the translated corpus, not as anomalous group members.
- Granularity is a tuning knob. Too-coarse tag extraction
merges distinct intents (e.g. losing
season); too-fine splits paraphrases unnecessarily (e.g. distinguishingcolour=navyfromcolour=blue). The post doesn't quote the threshold. - Traffic-share ranking can still miss long-tail defects. Ranking by traffic share biases sampling toward popular scenarios and under-samples rare-but-important ones. The same trade-off any sampling policy faces.
Seen in¶
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando clusters production queries by NER-tag set before translating and sampling for its pre-launch market-validation framework.
Related¶
- systems/zalando-ner-query-builder — the NER engine producing the clustering axis.
- systems/zalando-search-query-clustering — the pipeline implementing this concept in production.
- systems/zalando-search-quality-framework — downstream consumer.
- concepts/automated-test-generation-from-production-traffic
- concepts/translated-query-parity — complementary concept; tag-set stability is what makes cross-language parity tractable.
- patterns/ner-clustered-query-sampling-from-production — the production realisation.
- companies/zalando