Skip to content

CONCEPT Cited by 2 sources

Token-probability as ranking signal

Definition

Token-probability as ranking signal is the technique of retaining the LLM's per-token output probability (a logit or a normalised probability) past the discrete decision and using it as a continuous feature by a downstream ranker.

Instead of the typical flow LLM output → discrete label → downstream system, the pattern inserts one more source of information:

LLM output → discrete label + token probability
                           downstream ranker
                           (as continuous feature)

The downstream system treats "the LLM thinks this segment is name with probability 0.92" and "the LLM thinks this segment is name with probability 0.55" differently, even though both produce the same discrete name label.

Canonical wiki instances

Yelp query segmentation (2025-02-04)

"Among the different applications of this segmentation signal, we were able to (a) leverage token probabilities for (name) tags to improve our query to business name matching and ranking system."

Source: sources/2025-02-04-yelp-search-query-understanding-with-llms.

The segmentation model outputs both a tag ({name}) and the token-probability of that tag. Downstream, Yelp's business-name matching + ranking system treats the probability as a continuous feature: a high-probability name tag tells the ranker "weight exact-match on business name heavily"; a low- probability name tag tells the ranker "keep business-name matching as one signal among many".

Instacart PARSE LLM self-verification (2025-08-01)

"we specifically ask LLM to output 'yes' or 'no' first. Then we can get the logit of the first generated token, and compute the token probability of 'yes' as the confidence score."

Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms.

Instacart's self-verification routes the entailment prompt through a second LLM call whose output-token probability is kept as the confidence score. Unlike Yelp (which feeds the probability into ranking), Instacart feeds the probability into human-review routing (low confidence → human auditor).

The trick and its generality

The trick — don't discard the model's uncertainty at the sampling boundary — is generic across any LLM-driven component:

  • Ranking (Yelp name tag).
  • Routing (Instacart yes-token logit → human review).
  • Thresholded rewriting (Yelp location rewrite is gated on "high confidence in the location intent" — likely a threshold on the location tag's token-probability, though the exact mechanism isn't disclosed).
  • Cascade escalation (concepts/llm-cascade escalates low-confidence cheap-LLM outputs to the expensive LLM).

The unifying framing: the LLM's output distribution is strictly more information than the discrete sample from it; throwing it away is a lossy step you may not want to take.

Why this works at all

Two separate conditions:

  1. The LLM's output distribution has to be accessible. This requires API access to logprobs (OpenAI API logprobs parameter; similar across Anthropic / Google). Some hosted LLM endpoints don't expose them; some open-weights inference stacks do by default.
  2. The probability has to be at least weakly calibrated. If the LLM's 0.9 probability is no more reliable than its 0.6 probability, the continuous feature carries no signal. Empirically, frontier LLMs' token probabilities on classification-style tasks are reasonably calibrated — but not perfectly; see concepts/llm-self-verification calibration caveats.

Tradeoffs / gotchas

  • Token-probability is not necessarily task-probability. The probability of "yes" being the next token is not exactly the probability that the extracted value is correct — it's a proxy. Literature on self-verification discusses calibration; Yelp and Instacart both treat the approximation as good-enough.
  • Prompt engineering affects the probability. Framing that produces a single-token first output (yes/no, name/location/…) is necessary for clean probability extraction; multi-token outputs require deeper aggregation.
  • Latency / cost. Requesting logprobs is typically free at the API level but requires you to log and persist the probabilities alongside the discrete labels.
  • Distributional drift invalidates stored probabilities. If the LLM is updated by the provider, cached probabilities from the old model are no longer comparable to new probabilities. Version-lock the LLM to keep the continuous feature stable.
  • Threshold-tuning is per-downstream-consumer. The same token probability on the same tag might correspond to a different optimal threshold in a ranker vs. a human-review gate vs. a rewrite gate. Each consumer tunes its own threshold.

Seen in

Last updated · 476 distilled / 1,218 read