Skip to content

PATTERN Cited by 1 source

Low-confidence to human review

Intent

Route only the outputs the model is least certain about to human review, rather than sampling uniformly or reviewing everything, so that scarce human reviewer time is spent on the errors most likely to actually be errors.

The pattern consumes a per-output confidence score — typically from LLM self- verification or a classifier's softmax head — and routes outputs with confidence < threshold to a human correction queue. Outputs above threshold ship directly to downstream.

When to use

  • An ML / LLM pipeline that emits a calibrated (or calibrate-able) confidence score per output.
  • Production quality matters enough to warrant human review, but reviewing 100% of outputs is not budgeted.
  • The cost of a bad output reaching downstream is non- trivial (customer-facing catalog attribute, compliance- relevant extraction, safety-filtered content).

Mechanism

model(x) → (ŷ, confidence) → if confidence < τ:
                                 enqueue(x, ŷ) to human_review
                              else:
                                 emit ŷ

Human reviewers get the input + model's output and either confirm, correct, or reject. Corrections feed back into:

  • The authoritative output (replaces ŷ in the catalog).
  • The eval set for future prompt iterations.
  • (Optionally) a fine-tuning set if the model is being distilled or fine-tuned.

Why it beats uniform-random sampling

Uniform sampling for HITL review catches errors at the base error rate — if the model is 95% accurate, you review 100 outputs to catch ~5 errors.

Low-confidence routing catches errors at the conditional error rate given low confidence — if calibrated, a confidence threshold of 0.7 might isolate a population whose error rate is 30-50%, meaning you review 100 outputs to catch 30-50 errors. 5-10× the error-catching throughput per reviewer hour.

Pair with orthogonal sampling-based HITL

Low-confidence routing alone has a systematic blind spot: the model can be confidently wrong. Calibration failures on a new input distribution (new brand, new category, new packaging style) can produce high-confidence bad outputs that the routing never surfaces.

Ship both:

  • Low-confidence → human (this pattern) — catches known- uncertain errors.
  • patterns/human-in-the-loop-quality-sampling — periodic random sample regardless of confidence; catches drift + calibration failure on new input distributions.

Two loops, two failure modes covered.

Tradeoffs / gotchas

  • Threshold tuning is a per-task exercise. Too low → too many outputs sent to humans (reviewer backlog, latency). Too high → too few outputs reviewed (quality leaks). Tune against a labeled holdout to set the operating point.
  • Calibration is load-bearing. Raw yes-logits from LLMs may not be well-calibrated — a confidence of 0.7 might correspond to a 40% error rate in one task and 5% in another. Layer isotonic / Platt recalibration on a labeled set per-task.
  • Review latency may gate downstream. Some use cases (live customer-facing) can't wait on human review — for those, route to a stronger LLM instead and only defer to human async-ly.
  • Adverse feedback loop on training data. If you fine-tune on only the human-corrected outputs (which are skewed toward low-confidence inputs), the fine-tuned model may become worse on the majority of easy inputs. Mix in high-confidence-confirmed examples too.
  • Reviewer agreement is itself noisy. Catalog attribute review tasks often have annotator disagreement rates of 10-20%; don't treat "human answer" as ground truth — use majority-of-N or adjudicated-by-senior for disagreement cases.

Seen in

  • sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — canonical wiki instance. PARSE uses the self-verification confidence score to proactively route extractions with low confidence to human auditors for correction before catalog ingestion — one of two production HITL loops (the other being periodic random sampling for drift detection). "This process considers the extracted values of products with a low confidence score as potentially incorrect values, and has them reviewed and corrected by human auditors."
Last updated · 319 distilled / 1,201 read