Skip to content

CONCEPT Cited by 1 source

LLM self-verification

Definition

LLM self-verification is the technique of obtaining a confidence score for an LLM's output by asking the same (or a second) LLM a follow-up entailment question"given these inputs and this output, is the output correct, yes or no?" — and reading the logit of the yes token as the calibrated probability.

The key design detail is that the verification output is constrained to a single yes/no token first so that the first-generated-token's logit is a direct read-off of the model's P(correct | inputs, output). That probability is the confidence score — not a free-form "how confident are you from 0 to 10" response, which is known to correlate poorly with actual correctness.

Mechanism

  1. Run the extraction / reasoning prompt as usual. Get output ŷ.
  2. Construct a second prompt: "Given features X and task definition T, is ŷ the correct output? Answer yes or no."
  3. Query the LLM in a mode where the first generated token is forced (or very likely) to be yes or no.
  4. Read logit(yes) (and logit(no)) from the output distribution.
  5. Normalise: confidence = exp(logit(yes)) / (exp(logit(yes)) + exp(logit(no))).

That scalar is the self-verification confidence score.

Why logit-based beats free-form self-confidence

  • No verbalised-confidence miscalibration. LLMs asked "how confident are you?" tend to either always say "very confident" or produce numbers uncorrelated with correctness. Literature finds the direct logit read-off outperforms elicited verbal confidence (citations: [4] "Can LLMs Express Their Uncertainty?", [5] "Self-Evaluation Improves Selective Generation" in the PARSE post references).
  • Continuous signal, not discrete label. A scalar in [0, 1] allows threshold-tuning for different downstream policies (route to human at < 0.7, route to stronger LLM at < 0.85, ship at ≥ 0.85) — see concepts/llm-cascade and patterns/low-confidence-to-human-review.
  • No additional model needed. The verifier is usually the same LLM running a different prompt. No separate calibration model, no fine-tune required.

What self-verification buys you

  • Quality routing. Low-confidence outputs can be sent to a stronger LLM or human review rather than all outputs getting uniform downstream processing.
  • Proactive error detection in production. Instead of waiting for a random-sample eval to catch bad outputs, the confidence score flags suspect outputs at emission time.
  • Per-attribute / per-task tuning. Different thresholds per downstream cost tolerance.

Tradeoffs / gotchas

  • Calibration is not free. The raw yes-logit may not be well-calibrated (Expected Calibration Error not necessarily small). Production systems usually layer a monotonic recalibration (isotonic / Platt scaling) on top of the raw logit using a small labeled set.
  • Self-verification agrees with itself on systematic errors. If the model is systematically wrong on some class of inputs (e.g. always misclassifies a new brand), it may also systematically verify itself as correct. This is why self-verification must be paired with an orthogonal sampling check — see patterns/human-in-the-loop-quality-sampling.
  • Extra inference cost. Each extraction now needs two LLM calls (extract + verify). The verify prompt is typically shorter + single-token → cheaper, but non-zero.
  • Not all providers expose logits. If you're calling a hosted LLM API that only returns sampled text, you can't read the yes-token logit directly — you fall back to sampling multiple verifications and using the sample fraction, which needs more samples for the same resolution.

Seen in

  • sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — canonical wiki instance. Instacart's PARSE uses the entailment-prompt
  • yes-logit technique to produce a per-extraction confidence score, which then drives low-confidence HITL routing in production. "We query the LLM with a second scoring prompt. The prompt will ask LLM to do an entailment task: asking LLM if the extracted attribute value by the extraction prompt is correct based on the product features and attribute definition… we specifically ask LLM to output 'yes' or 'no' first. Then we can get the logit of the first generated token, and compute the token probability of 'yes' as the confidence score." Cites AutoMix (2023) as the literature basis.
Last updated · 319 distilled / 1,201 read