PATTERN Cited by 1 source
Human-in-the-loop quality sampling¶
Intent¶
Continuously estimate production quality of an ML / LLM pipeline by periodically drawing a random sample of its outputs and sending them for human review (or LLM-as-judge auto-review), so that systematic drift — the model's accuracy degrading on new input distributions — is detected even when the model's own confidence scores stay high.
The pattern is structurally orthogonal to [[patterns/low- confidence-to-human-review]]: where low-confidence routing samples "outputs the model itself thinks may be wrong", random sampling samples "any output, including ones the model is confident about". They catch different failure modes and both belong in a production quality pipeline.
When to use¶
- Long-lived production ML / LLM pipeline where inputs evolve (new products, new content, new user cohorts).
- Model outputs are not otherwise labeled in production (no user feedback signal, no downstream oracle).
- Output quality affects a customer-facing surface whose SLA requires ongoing quality evidence, not just launch- time eval.
Mechanism¶
- Periodic sample generation. On a schedule (daily, weekly), randomly sample N outputs from production extractions / predictions.
- Dual-path evaluation.
- Human auditors label the ground truth attribute value. This is slow but is the calibration anchor.
- LLM-as-judge auto-evaluates the same sample for speed and volume. Use this to multiply coverage.
- Quality metrics: per-attribute accuracy, precision, recall, F1 on the sample.
- Drift alarm: if metrics degrade by more than a threshold vs. the last N samples, open a prompt- iteration ticket / rollback the latest prompt version / page the on-call.
Why random beats confidence-gated sampling for drift¶
The confidence score is a function of the current model on current inputs. When the input distribution shifts (a new brand family, a new nutrition-label format), the model can be systematically, confidently wrong — outputs with high confidence that are nonetheless incorrect. Confidence-routed HITL review never surfaces these.
A random sample is distribution-representative and will include high-confidence-wrong outputs in proportion to their prevalence. It's the only HITL mechanism that sees this failure mode.
Development mode vs. production mode¶
The same sampling mechanism appears in two places:
- Development mode (pre-deploy): run on a small sample the prompt author uploaded, to compute quality metrics while iterating on the prompt. Faster feedback loop than waiting on full-production evaluation. LLM-as-judge is especially valuable here to unblock humans as the iteration bottleneck.
- Production mode (post-deploy): run on a periodic random sample of live extractions to detect drift and alarm.
Same pipeline, two different populations.
Tradeoffs / gotchas¶
- Sample size × review budget is the operating point. Smaller samples → lower review cost but wider confidence intervals on drift metrics. Tune against historical drift magnitude and alerting tolerance.
- LLM-judge agreement with humans must itself be monitored. If the judge model drifts or is updated, its auto-evaluations may diverge from human truth — recalibrate periodically.
- Random isn't enough for long-tail categories. A purely random sample may under-represent rare categories you care about. Consider stratified sampling across category / brand / language buckets.
- Reviewer backlog coupling. Both this pattern and patterns/low-confidence-to-human-review draw from the same reviewer pool. Budget accordingly or starve one of the two.
- Alarm thresholds are a learning curve. Early in a new platform, quality metrics are noisy and drift alarms false-positive often. Expect to re-tune.
Seen in¶
- sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — canonical wiki instance. PARSE's quality-screening component implements this pattern in both modes: development-mode sample-based eval (human + LLM-judge) to unblock prompt iteration, and production-mode periodic sampling to monitor live extraction quality for drift. Runs in parallel with the proactive low-confidence HITL loop. "The component creates a sample set periodically from the attribute extraction results of new products, and has it evaluated by either human auditors or LLM evaluation. This can help monitor if there is a quality drop that requires attention."
Related¶
- patterns/low-confidence-to-human-review — the orthogonal HITL loop; random sampling catches drift, low- confidence catches uncertain outputs. Run both.
- patterns/llm-attribute-extraction-platform — the broader platform pattern this sampling lives inside.
- concepts/llm-as-judge — the auto-eval sibling that multiplies human reviewer coverage.
- concepts/llm-self-verification — the confidence source for the other HITL loop.
- systems/instacart-parse — canonical production instance.