Skip to content

CONCEPT Cited by 1 source

Stratified topic sampling

Definition

Stratified topic sampling is the production-monitoring sampling strategy that partitions incoming traffic by topic / intent / category and draws a proportional (or deliberately-over-sampled) batch from each stratum for evaluation — rather than uniform random sampling across all traffic.

The goal is to guarantee coverage of rare but high-impact topics, which uniform random sampling would under-represent because most traffic concentrates on a head of common topics.

Why stratify by topic

In customer-support chatbot traffic (and many other applications), topic distribution is long-tailed:

  • A handful of topics ("Where's my order?", "Refund request") account for 50-70% of volume.
  • Dozens of tail topics ("Alcohol verification failed", "Family-account billing", "EBT payment edge cases") each account for <1% of volume — but failing on any one of them can be catastrophic (compliance violation, SEV ticket, regulatory attention).

Uniform random sampling sends most of your evaluation budget to the head, leaving tail topics with 0 or 1 sample per run — too few to detect regressions. Stratified sampling fixes the per-topic minimum so every topic gets meaningful coverage.

Instacart's LACE production pipeline uses "stratified sampling based on topic distribution" to feed its evaluation dashboards and experimentation-platform integration (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Mechanism

  1. Classify every chat (or every input) by topic. Topic classifier can be the same one the chatbot uses internally, a separate classifier, or a post-hoc LLM labeller.
  2. Define strata. The topic taxonomy is the stratum list; typical taxonomies have 10–100 topics.
  3. Sample per stratum. Either:
  4. Proportional stratified: draw N × p_i from topic i (p_i = topic's traffic share); still favours head topics but guarantees minimum coverage when N is large.
  5. Equal stratified: draw N / K from each of K topics regardless of share; maximises tail coverage but biases the aggregate score toward tail quality.
  6. Floor + proportional: floor of m per topic + remainder proportional; balances head weighting with tail coverage.
  7. Aggregate. Reweight if reporting an overall quality score. Otherwise report per-topic quality, which is usually the more actionable view.

Contrast with alternatives

Strategy Tail coverage Bias on aggregate When
Uniform random poor unbiased head-dominated traffic where tail risk is low
Proportional stratified fair unbiased when reweighted want topic-level visibility + unbiased aggregate
Equal stratified best aggregate biased toward tail quality tail risk is the dominant concern
Floor + proportional good nearly unbiased most production systems — LACE's likely shape
Adversarial / failure-mined best on known failures heavily biased regression hunting for specific bugs
  • concepts/long-tail-query — the traffic shape that motivates the need for stratification (the Instacart Intent Engine post quantifies this: 98% of queries cache-served, 2% tail model- served; LACE on support traffic has the same shape).
  • concepts/production-data-diversity — stratification is one mechanism for ensuring production-data diversity is reflected in evaluation samples.
  • patterns/human-in-the-loop-quality-sampling — the parent pattern; sampling strategy (random vs. stratified vs. low-confidence) is the load-bearing design choice.

Tradeoffs

  • Requires a reliable topic classifier. If classification is noisy, strata are noisy, and coverage guarantees are noisy. Topic drift over time means the classifier itself needs monitoring.
  • New topics aren't in the stratum list yet. A user asking about a feature launched last week hits an "unknown" bucket; need explicit handling to avoid dropping genuinely novel topics on the floor.
  • Stratum proliferation. 100 topics × 5 dimensions = 500 cells; need to think about which slices to actually inspect in dashboards vs. aggregate away.

Seen in

Last updated · 517 distilled / 1,221 read