CONCEPT Cited by 1 source

Evaluation label¶

An evaluation label is the unit of offline agent evaluation. It has two parts, and the key design move is that only one of them is visible to the agent under test:

Part	Content	Agent sees?
Ground truth	The actual resolution / root cause of the real incident the label was derived from	No
World snapshot	The signals that existed at the moment the issue occurred — the queries, metadata, and relationships the agent would need to investigate (but not raw expired telemetry)	Yes

The split reflects the production constraint. In a live investigation the agent never sees the answer — only the evidence the operator would see. Evaluation that gives the agent privileged access (the answer, or a perfectly scoped evidence bundle) will over-state production quality.

Why the world snapshot is queries, not telemetry¶

Raw telemetry has a TTL. Metrics, logs, and traces expire on vendor-defined retention windows. A label whose world snapshot is raw bytes decays into unreplayability the moment its source data expires.

Datadog's design instead stores how to find the signals the agent would need — where memory metrics live, how to query container logs, what deployment events are in scope — and reconstructs the investigation environment from those queries. The schema of the world outlives the data. See concepts/telemetry-ttl-one-way-door for what goes wrong when the reconstruction step itself is deferred past expiry.

Why the agent must not see the ground truth¶

A label contaminated with the answer degenerates into pattern-matching: the agent learns "when I see label.ground_truth == 'OOM', say OOM". That isn't investigation, and the scores it produces don't predict production. Separation of ground truth from world snapshot is load-bearing.

Composition with other evaluation decisions¶

An evaluation label is only one layer; real platforms need:

Noise injection — the world snapshot must include unrelated components, not only the signals directly tied to the root cause. See concepts/noise-injection-in-evaluation and patterns/noisy-simulated-evaluation-environment.
Trajectory scoring — a label is more useful when its ground truth can support depth-of-investigation scoring (concepts/trajectory-evaluation), not just final-answer check.
Confidence scoring of the label itself — not all labels are equally good; a label whose ground-truth RCA wouldn't survive a "5 Whys" postmortem should be flagged for human review or dropped.

Seen in¶

sources/2026-04-07-datadog-bits-ai-sre-eval-platform — canonical articulation in the Bits AI SRE evaluation platform. Worked example: "a label might define the root cause as a Kubernetes pod being OOM killed, with a world snapshot that preserves the telemetry queries the agent would need — such as where to find memory metrics, container logs, and deployment events — rather than raw data."