CONCEPT Cited by 1 source

Stratified topic sampling¶

Definition¶

Stratified topic sampling is the production-monitoring sampling strategy that partitions incoming traffic by topic / intent / category and draws a proportional (or deliberately-over-sampled) batch from each stratum for evaluation — rather than uniform random sampling across all traffic.

The goal is to guarantee coverage of rare but high-impact topics, which uniform random sampling would under-represent because most traffic concentrates on a head of common topics.

Why stratify by topic¶

In customer-support chatbot traffic (and many other applications), topic distribution is long-tailed:

A handful of topics ("Where's my order?", "Refund request") account for 50-70% of volume.
Dozens of tail topics ("Alcohol verification failed", "Family-account billing", "EBT payment edge cases") each account for <1% of volume — but failing on any one of them can be catastrophic (compliance violation, SEV ticket, regulatory attention).

Uniform random sampling sends most of your evaluation budget to the head, leaving tail topics with 0 or 1 sample per run — too few to detect regressions. Stratified sampling fixes the per-topic minimum so every topic gets meaningful coverage.

Instacart's LACE production pipeline uses "stratified sampling based on topic distribution" to feed its evaluation dashboards and experimentation-platform integration (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Mechanism¶

Classify every chat (or every input) by topic. Topic classifier can be the same one the chatbot uses internally, a separate classifier, or a post-hoc LLM labeller.
Define strata. The topic taxonomy is the stratum list; typical taxonomies have 10–100 topics.
Sample per stratum. Either:
Proportional stratified: draw N × p_i from topic i (p_i = topic's traffic share); still favours head topics but guarantees minimum coverage when N is large.
Equal stratified: draw N / K from each of K topics regardless of share; maximises tail coverage but biases the aggregate score toward tail quality.
Floor + proportional: floor of m per topic + remainder proportional; balances head weighting with tail coverage.
Aggregate. Reweight if reporting an overall quality score. Otherwise report per-topic quality, which is usually the more actionable view.

Contrast with alternatives¶

Strategy	Tail coverage	Bias on aggregate	When
Uniform random	poor	unbiased	head-dominated traffic where tail risk is low
Proportional stratified	fair	unbiased when reweighted	want topic-level visibility + unbiased aggregate
Equal stratified	best	aggregate biased toward tail quality	tail risk is the dominant concern
Floor + proportional	good	nearly unbiased	most production systems — LACE's likely shape
Adversarial / failure-mined	best on known failures	heavily biased	regression hunting for specific bugs

concepts/long-tail-query — the traffic shape that motivates the need for stratification (the Instacart Intent Engine post quantifies this: 98% of queries cache-served, 2% tail model- served; LACE on support traffic has the same shape).
concepts/production-data-diversity — stratification is one mechanism for ensuring production-data diversity is reflected in evaluation samples.
patterns/human-in-the-loop-quality-sampling — the parent pattern; sampling strategy (random vs. stratified vs. low-confidence) is the load-bearing design choice.

Tradeoffs¶

Requires a reliable topic classifier. If classification is noisy, strata are noisy, and coverage guarantees are noisy. Topic drift over time means the classifier itself needs monitoring.
New topics aren't in the stratum list yet. A user asking about a feature launched last week hits an "unknown" bucket; need explicit handling to avoid dropping genuinely novel topics on the floor.
Stratum proliferation. 100 topics × 5 dimensions = 500 cells; need to think about which slices to actually inspect in dashboards vs. aggregate away.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: "Using stratified sampling based on topic distribution, LACE feeds into dashboards that let us: monitor performance trends over time, analyze specific interaction details to pinpoint issues, integrate feedback directly into experimentation platforms for real-time improvements."