CONCEPT Cited by 1 source

Noise injection in evaluation¶

Noise injection in evaluation is the counter-intuitive discipline of making the simulated environment an agent is evaluated in messier, not cleaner — by including signals, components, and context that are unrelated to the scenario under test.

The naive mistake¶

The intuitive version of a replayable evaluation environment for an agent contains exactly the signals relevant to the root cause: - the failing pod, - the memory metrics, - the deployment events near the failure.

Anything else is noise, so it's omitted. This is wrong for exactly the reason it feels right: the agent is effectively given an open-book exam with only the relevant pages.

In production, a real incident looks like:

one failing pod, amid hundreds of healthy pods,
memory metrics for the failing service, alongside metrics for dozens of other services on the same platform, team, or monitor that happen to be fluctuating for unrelated reasons,
the deployment event that actually caused the regression, alongside unrelated deployment events in nearby services,
services with similar names that an inattentive agent can confuse,
red herrings from other teams' in-flight incidents or capacity tests.

An SRE investigator's real skill is sifting. Evaluating the agent on a scenario with the sifting done already over-states quality.

The discipline¶

Expand the world snapshot (see concepts/evaluation-label) to include related components even when they are not directly involved in the failure. Datadog's rubric:

A component might be included because it belongs to the same platform, team, or monitor, or even just similarly named.

(Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform)

This provides a cheap, semi-principled way to inject real-world noise: the adjacency relationships are already present in the telemetry graph, so expanding the snapshot is a traversal, not a generation problem.

Why it works¶

Forces discrimination. The agent must choose what to investigate — that's the job in production.
Surfaces silent-context failures. E.g. the Datadog regression where auto-extracting the monitor's service name pulled in irrelevant signals: this only regresses investigations where irrelevant signals exist in context. Clean snapshots mask it; noisy snapshots catch it.
Closes the eval-vs-prod gap. Without noise injection, labels pass while live traffic quietly degrades, and the eval platform systematically lies about quality.

Cost¶

More signal reconstruction work per label — snapshotting the broader neighbourhood, not just the failing path.
Harder scenarios → lower absolute pass rate. Datadog reports an ~11% pass-rate drop and ~35% label-count drop when early narrow labels were discarded and regenerated with broader scope. Accepted because "in the long term it made our evaluations predictive of production behaviour."
Judges / trajectory scorers need to be robust to "agent noticed but correctly ignored" irrelevant signals. A naive scorer can punish an agent that did the right thing by not following a red herring.

Seen in¶

sources/2026-04-07-datadog-bits-ai-sre-eval-platform — explicit rationale: "The most counterintuitive thing we learned was that our simulated worlds need to be messy." See patterns/noisy-simulated-evaluation-environment for the operational pattern.