Skip to content

CONCEPT Cited by 1 source

pass@k

pass@k asks: for a given scenario, over k independent attempts, does the agent succeed on at least one of them?

It is a simple, widely-used metric in LLM / code-generation / agent evaluation that acknowledges the non-determinism of the system under test. pass@1 is the classic deterministic pass/fail. pass@k with k > 1 separates capability ("the agent can do this sometimes") from reliability ("the agent does this consistently").

Why it matters for agent evaluation

LLM agents sample from a distribution. The same inputs can produce different tool call sequences, different final answers, across runs. With a single attempt:

  • A scenario the agent solves 80% of the time looks like an 80% pass rate. That's a useful number, but it collapses "probably reliable" and "lucky once out of five tries" into similar-looking metrics.

pass@k lets you disentangle:

  • High pass@1, high pass@k — reliable.
  • Low pass@1, high pass@k — capable but sampling-unstable; a temperature / prompt / retry-policy problem more than a capability problem.
  • Low pass@k — capability gap; no amount of retry saves this scenario.

Segmenting a label set by (pass@1, pass@k) surfaces which scenarios need better sampling and which need model / prompt / tool-set upgrades.

Typical uses

  • Capability tracking across agent versions: pass@k trending up without pass@1 trending up = wider capability envelope but same reliability.
  • Sampling-strategy tuning: ensemble / best-of-k / self-consistency strategies raise pass@1 toward pass@k if the delta is large.
  • Cost/quality trade-off selection: deploy k independent attempts
  • judge-selection when the pass@k–pass@1 delta for the use case justifies the k× inference cost.

Composition with trajectory scoring

pass@k composes with concepts/trajectory-evaluation: a scenario with low pass@k but high trajectory scores on failing runs is one where the agent is investigating well but synthesising wrong; a scenario with low pass@k and low trajectory scores points at a tool / context / data-access gap earlier in the pipeline.

Seen in

Last updated · 200 distilled / 1,178 read