Skip to content

CONCEPT Cited by 1 source

Noise injection in evaluation

Noise injection in evaluation is the counter-intuitive discipline of making the simulated environment an agent is evaluated in messier, not cleaner — by including signals, components, and context that are unrelated to the scenario under test.

The naive mistake

The intuitive version of a replayable evaluation environment for an agent contains exactly the signals relevant to the root cause: - the failing pod, - the memory metrics, - the deployment events near the failure.

Anything else is noise, so it's omitted. This is wrong for exactly the reason it feels right: the agent is effectively given an open-book exam with only the relevant pages.

In production, a real incident looks like:

  • one failing pod, amid hundreds of healthy pods,
  • memory metrics for the failing service, alongside metrics for dozens of other services on the same platform, team, or monitor that happen to be fluctuating for unrelated reasons,
  • the deployment event that actually caused the regression, alongside unrelated deployment events in nearby services,
  • services with similar names that an inattentive agent can confuse,
  • red herrings from other teams' in-flight incidents or capacity tests.

An SRE investigator's real skill is sifting. Evaluating the agent on a scenario with the sifting done already over-states quality.

The discipline

Expand the world snapshot (see concepts/evaluation-label) to include related components even when they are not directly involved in the failure. Datadog's rubric:

A component might be included because it belongs to the same platform, team, or monitor, or even just similarly named.

(Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform)

This provides a cheap, semi-principled way to inject real-world noise: the adjacency relationships are already present in the telemetry graph, so expanding the snapshot is a traversal, not a generation problem.

Why it works

  • Forces discrimination. The agent must choose what to investigate — that's the job in production.
  • Surfaces silent-context failures. E.g. the Datadog regression where auto-extracting the monitor's service name pulled in irrelevant signals: this only regresses investigations where irrelevant signals exist in context. Clean snapshots mask it; noisy snapshots catch it.
  • Closes the eval-vs-prod gap. Without noise injection, labels pass while live traffic quietly degrades, and the eval platform systematically lies about quality.

Cost

  • More signal reconstruction work per label — snapshotting the broader neighbourhood, not just the failing path.
  • Harder scenarios → lower absolute pass rate. Datadog reports an ~11% pass-rate drop and ~35% label-count drop when early narrow labels were discarded and regenerated with broader scope. Accepted because "in the long term it made our evaluations predictive of production behaviour."
  • Judges / trajectory scorers need to be robust to "agent noticed but correctly ignored" irrelevant signals. A naive scorer can punish an agent that did the right thing by not following a red herring.

Seen in

Last updated · 200 distilled / 1,178 read