CONCEPT Cited by 2 sources
Token-probability as ranking signal¶
Definition¶
Token-probability as ranking signal is the technique of retaining the LLM's per-token output probability (a logit or a normalised probability) past the discrete decision and using it as a continuous feature by a downstream ranker.
Instead of the typical flow LLM output → discrete label →
downstream system, the pattern inserts one more source of
information:
The downstream system treats "the LLM thinks this segment is
name with probability 0.92" and "the LLM thinks this segment
is name with probability 0.55" differently, even though both
produce the same discrete name label.
Canonical wiki instances¶
Yelp query segmentation (2025-02-04)¶
"Among the different applications of this segmentation signal, we were able to (a) leverage token probabilities for (name) tags to improve our query to business name matching and ranking system."
Source: sources/2025-02-04-yelp-search-query-understanding-with-llms.
The segmentation model outputs both a tag ({name}) and the
token-probability of that tag. Downstream, Yelp's business-name
matching + ranking system treats the probability as a
continuous feature: a high-probability name tag tells the
ranker "weight exact-match on business name heavily"; a low-
probability name tag tells the ranker "keep business-name
matching as one signal among many".
Instacart PARSE LLM self-verification (2025-08-01)¶
"we specifically ask LLM to output 'yes' or 'no' first. Then we can get the logit of the first generated token, and compute the token probability of 'yes' as the confidence score."
Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms.
Instacart's self-verification routes the entailment prompt through a second LLM call whose output-token probability is kept as the confidence score. Unlike Yelp (which feeds the probability into ranking), Instacart feeds the probability into human-review routing (low confidence → human auditor).
The trick and its generality¶
The trick — don't discard the model's uncertainty at the sampling boundary — is generic across any LLM-driven component:
- Ranking (Yelp
nametag). - Routing (Instacart
yes-token logit → human review). - Thresholded rewriting (Yelp
location
rewrite is gated on "high confidence in the location
intent" — likely a threshold on the
locationtag's token-probability, though the exact mechanism isn't disclosed). - Cascade escalation (concepts/llm-cascade escalates low-confidence cheap-LLM outputs to the expensive LLM).
The unifying framing: the LLM's output distribution is strictly more information than the discrete sample from it; throwing it away is a lossy step you may not want to take.
Why this works at all¶
Two separate conditions:
- The LLM's output distribution has to be accessible. This
requires API access to
logprobs(OpenAI APIlogprobsparameter; similar across Anthropic / Google). Some hosted LLM endpoints don't expose them; some open-weights inference stacks do by default. - The probability has to be at least weakly calibrated.
If the LLM's
0.9probability is no more reliable than its0.6probability, the continuous feature carries no signal. Empirically, frontier LLMs' token probabilities on classification-style tasks are reasonably calibrated — but not perfectly; see concepts/llm-self-verification calibration caveats.
Tradeoffs / gotchas¶
- Token-probability is not necessarily task-probability. The probability of "yes" being the next token is not exactly the probability that the extracted value is correct — it's a proxy. Literature on self-verification discusses calibration; Yelp and Instacart both treat the approximation as good-enough.
- Prompt engineering affects the probability. Framing that
produces a single-token first output (
yes/no,name/location/…) is necessary for clean probability extraction; multi-token outputs require deeper aggregation. - Latency / cost. Requesting
logprobsis typically free at the API level but requires you to log and persist the probabilities alongside the discrete labels. - Distributional drift invalidates stored probabilities. If the LLM is updated by the provider, cached probabilities from the old model are no longer comparable to new probabilities. Version-lock the LLM to keep the continuous feature stable.
- Threshold-tuning is per-downstream-consumer. The same token probability on the same tag might correspond to a different optimal threshold in a ranker vs. a human-review gate vs. a rewrite gate. Each consumer tunes its own threshold.
Seen in¶
- sources/2025-02-04-yelp-search-query-understanding-with-llms
— Yelp's
{name}-tag token probability used as continuous feature in business-name matching + ranking. - sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms
— Instacart's
yes-token logit used as{confidence score}for human-review routing (second-LLM-call variant).
Related¶
- concepts/llm-self-verification — the entailment-prompt variant that produces a token-probability confidence score.
- concepts/query-understanding — the task family where this pattern originates on the wiki.
- concepts/implicit-query-location-rewrite — the confidence-gated rewrite consumer of this signal.
- systems/yelp-query-understanding — canonical wiki instance for ranking use-case.
- systems/instacart-parse — canonical wiki instance for human-review-routing use-case.
- companies/yelp / companies/instacart