CONCEPT Cited by 1 source
pass@k¶
pass@k asks: for a given scenario, over k independent attempts, does the agent succeed on at least one of them?
It is a simple, widely-used metric in LLM / code-generation / agent evaluation that acknowledges the non-determinism of the system under test. pass@1 is the classic deterministic pass/fail. pass@k with k > 1 separates capability ("the agent can do this sometimes") from reliability ("the agent does this consistently").
Why it matters for agent evaluation¶
LLM agents sample from a distribution. The same inputs can produce different tool call sequences, different final answers, across runs. With a single attempt:
- A scenario the agent solves 80% of the time looks like an 80% pass rate. That's a useful number, but it collapses "probably reliable" and "lucky once out of five tries" into similar-looking metrics.
pass@k lets you disentangle:
- High pass@1, high pass@k — reliable.
- Low pass@1, high pass@k — capable but sampling-unstable; a temperature / prompt / retry-policy problem more than a capability problem.
- Low pass@k — capability gap; no amount of retry saves this scenario.
Segmenting a label set by (pass@1, pass@k) surfaces which scenarios need better sampling and which need model / prompt / tool-set upgrades.
Typical uses¶
- Capability tracking across agent versions: pass@k trending up without pass@1 trending up = wider capability envelope but same reliability.
- Sampling-strategy tuning: ensemble / best-of-k / self-consistency strategies raise pass@1 toward pass@k if the delta is large.
- Cost/quality trade-off selection: deploy k independent attempts
- judge-selection when the pass@k–pass@1 delta for the use case justifies the k× inference cost.
Composition with trajectory scoring¶
pass@k composes with concepts/trajectory-evaluation: a scenario with low pass@k but high trajectory scores on failing runs is one where the agent is investigating well but synthesising wrong; a scenario with low pass@k and low trajectory scores points at a tool / context / data-access gap earlier in the pipeline.
Seen in¶
- sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog names pass@k as one of the label attributes the Bits AI SRE evaluation platform tracks over time, alongside "consistently passing" and "consistently failing" classifications. Used to understand agent success evolution, strong and weak domains, and to prioritise label-set expansion into scenarios where the agent currently fails.