CONCEPT Cited by 1 source

Trajectory evaluation¶

Trajectory evaluation scores an agent on how it investigated, not only whether the final answer was correct. For a given evaluation label:

How close did the agent get to the correct answer?
Did it investigate deeply enough?
Did it surface valuable telemetry along the way, even if it didn't reach the right conclusion?
What tools did it call, in what order, with what parameters?

Contrast with terminal-answer scoring: 1/0 on "was the final output correct?" This is what you get when the label is just a question + expected answer.

Why it matters¶

LLM agents often produce the right answer for the wrong reason, or the wrong answer via a promising investigation. Terminal-answer scoring treats both as failures/successes with no middle ground, which is a weak regression signal during iteration. Trajectory scoring exposes:

Partial credit — an agent that identified the right service but not the right line of code is better than one that pointed at an unrelated component.
Latent capability — an agent that surfaced the right telemetry and then failed to synthesise it is one prompt tweak away from working; one that never even queried the right signal is farther.
Reasoning-shape regressions — a prompt change that quietly makes the agent stop using a valuable tool shows up here even when the final-answer score is stable.

Dependency on label quality¶

Trajectory scoring is only as good as the label. If the label's ground-truth RCA is shallow ("service X was slow"), the best it can score is "did the agent identify service X". A ground-truth RCA that survives a "5 Whys" postmortem supports scoring "did the agent correctly walk the causal chain." Datadog reports that pushing their label quality to postmortem-grade ~30% higher-quality RCAs was what unlocked meaningful trajectory evaluation in practice (Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform).

Implementation¶

In practice trajectory scoring is done by an LLM judge given the full trace (tool calls + tool outputs + intermediate reasoning + final answer) plus the ground-truth RCA and a rubric. The judge emits component scores (correctness, depth, telemetry-surfacing) rather than a single pass/fail.

Composition with pass@k¶

Trajectory scoring composes naturally with concepts/pass-at-k: report trajectory metrics across k independent runs, not just final-answer success. A scenario where pass@1 is low but pass@k is high and trajectory scores are high usually means the agent has the capability but is sampling-unstable — a different fix from a scenario where trajectory scores are uniformly low.

Seen in¶

sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog's motivation for trajectory scoring is explicit: after label quality improved ~30%, they could evaluate "not just whether Bits got the correct answer, but how helpful its investigation was." See also systems/bits-ai-sre-eval-platform.