Skip to content

CONCEPT Cited by 1 source

Trajectory evaluation

Trajectory evaluation scores an agent on how it investigated, not only whether the final answer was correct. For a given evaluation label:

  • How close did the agent get to the correct answer?
  • Did it investigate deeply enough?
  • Did it surface valuable telemetry along the way, even if it didn't reach the right conclusion?
  • What tools did it call, in what order, with what parameters?

Contrast with terminal-answer scoring: 1/0 on "was the final output correct?" This is what you get when the label is just a question + expected answer.

Why it matters

LLM agents often produce the right answer for the wrong reason, or the wrong answer via a promising investigation. Terminal-answer scoring treats both as failures/successes with no middle ground, which is a weak regression signal during iteration. Trajectory scoring exposes:

  • Partial credit — an agent that identified the right service but not the right line of code is better than one that pointed at an unrelated component.
  • Latent capability — an agent that surfaced the right telemetry and then failed to synthesise it is one prompt tweak away from working; one that never even queried the right signal is farther.
  • Reasoning-shape regressions — a prompt change that quietly makes the agent stop using a valuable tool shows up here even when the final-answer score is stable.

Dependency on label quality

Trajectory scoring is only as good as the label. If the label's ground-truth RCA is shallow ("service X was slow"), the best it can score is "did the agent identify service X". A ground-truth RCA that survives a "5 Whys" postmortem supports scoring "did the agent correctly walk the causal chain." Datadog reports that pushing their label quality to postmortem-grade ~30% higher-quality RCAs was what unlocked meaningful trajectory evaluation in practice (Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform).

Implementation

In practice trajectory scoring is done by an LLM judge given the full trace (tool calls + tool outputs + intermediate reasoning + final answer) plus the ground-truth RCA and a rubric. The judge emits component scores (correctness, depth, telemetry-surfacing) rather than a single pass/fail.

Composition with pass@k

Trajectory scoring composes naturally with concepts/pass-at-k: report trajectory metrics across k independent runs, not just final-answer success. A scenario where pass@1 is low but pass@k is high and trajectory scores are high usually means the agent has the capability but is sampling-unstable — a different fix from a scenario where trajectory scores are uniformly low.

Seen in

Last updated · 200 distilled / 1,178 read