CONCEPT Cited by 1 source
Stratified topic sampling¶
Definition¶
Stratified topic sampling is the production-monitoring sampling strategy that partitions incoming traffic by topic / intent / category and draws a proportional (or deliberately-over-sampled) batch from each stratum for evaluation — rather than uniform random sampling across all traffic.
The goal is to guarantee coverage of rare but high-impact topics, which uniform random sampling would under-represent because most traffic concentrates on a head of common topics.
Why stratify by topic¶
In customer-support chatbot traffic (and many other applications), topic distribution is long-tailed:
- A handful of topics ("Where's my order?", "Refund request") account for 50-70% of volume.
- Dozens of tail topics ("Alcohol verification failed", "Family-account billing", "EBT payment edge cases") each account for <1% of volume — but failing on any one of them can be catastrophic (compliance violation, SEV ticket, regulatory attention).
Uniform random sampling sends most of your evaluation budget to the head, leaving tail topics with 0 or 1 sample per run — too few to detect regressions. Stratified sampling fixes the per-topic minimum so every topic gets meaningful coverage.
Instacart's LACE production pipeline uses "stratified sampling based on topic distribution" to feed its evaluation dashboards and experimentation-platform integration (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).
Mechanism¶
- Classify every chat (or every input) by topic. Topic classifier can be the same one the chatbot uses internally, a separate classifier, or a post-hoc LLM labeller.
- Define strata. The topic taxonomy is the stratum list; typical taxonomies have 10–100 topics.
- Sample per stratum. Either:
- Proportional stratified: draw
N × p_ifrom topici(p_i= topic's traffic share); still favours head topics but guarantees minimum coverage whenNis large. - Equal stratified: draw
N / Kfrom each ofKtopics regardless of share; maximises tail coverage but biases the aggregate score toward tail quality. - Floor + proportional: floor of
mper topic + remainder proportional; balances head weighting with tail coverage. - Aggregate. Reweight if reporting an overall quality score. Otherwise report per-topic quality, which is usually the more actionable view.
Contrast with alternatives¶
| Strategy | Tail coverage | Bias on aggregate | When |
|---|---|---|---|
| Uniform random | poor | unbiased | head-dominated traffic where tail risk is low |
| Proportional stratified | fair | unbiased when reweighted | want topic-level visibility + unbiased aggregate |
| Equal stratified | best | aggregate biased toward tail quality | tail risk is the dominant concern |
| Floor + proportional | good | nearly unbiased | most production systems — LACE's likely shape |
| Adversarial / failure-mined | best on known failures | heavily biased | regression hunting for specific bugs |
Related concepts¶
- concepts/long-tail-query — the traffic shape that motivates the need for stratification (the Instacart Intent Engine post quantifies this: 98% of queries cache-served, 2% tail model- served; LACE on support traffic has the same shape).
- concepts/production-data-diversity — stratification is one mechanism for ensuring production-data diversity is reflected in evaluation samples.
- patterns/human-in-the-loop-quality-sampling — the parent pattern; sampling strategy (random vs. stratified vs. low-confidence) is the load-bearing design choice.
Tradeoffs¶
- Requires a reliable topic classifier. If classification is noisy, strata are noisy, and coverage guarantees are noisy. Topic drift over time means the classifier itself needs monitoring.
- New topics aren't in the stratum list yet. A user asking about a feature launched last week hits an "unknown" bucket; need explicit handling to avoid dropping genuinely novel topics on the floor.
- Stratum proliferation. 100 topics × 5 dimensions = 500 cells; need to think about which slices to actually inspect in dashboards vs. aggregate away.
Seen in¶
- sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "Using stratified sampling based on topic distribution, LACE feeds into dashboards that let us: monitor performance trends over time, analyze specific interaction details to pinpoint issues, integrate feedback directly into experimentation platforms for real-time improvements."