Skip to content

PATTERN Cited by 1 source

Noisy simulated evaluation environment

When replaying an evaluation label against a candidate agent, reconstruct a simulated world that is deliberately noisy — populated with signals, components, metrics, and context that are unrelated to the scenario's root cause, but adjacent in the real telemetry graph.

Intent

A snapshot-replay harness whose simulated environment contains only the signals directly tied to the root cause is equivalent to "an open-book exam with only the relevant pages." The agent scores well because the signal-to-noise ratio is artificially inflated; production regression is invisible. The standard failure mode of naïve offline eval.

Mechanism

  1. Start from the world-snapshot queries recorded on the label.
  2. Expand the snapshot by traversing the telemetry graph to pull in adjacent components not directly involved in the failure:
  3. same platform / same cluster
  4. same team / same monitor
  5. similar name (name-similarity bait)
  6. temporally co-occurring unrelated incidents, deployments, capacity events
  7. Reconstruct the signals for the expanded set inside the simulated environment, at the same fidelity as the directly relevant signals.
  8. Isolate per-label — one simulated environment per label replay, with a data-layer guarantee that context from one label can't bleed into another.
  9. Replay the candidate agent against the noisy environment; score as usual (concepts/trajectory-evaluation + pass@k + final-answer).

Why it works

  • Forces discrimination. Sifting is the job; eval must test sifting.
  • Exposes silent-context regressions. Feature changes that pull in irrelevant context (e.g. auto-extracting a monitor's service name into the agent's initial context) only regress when there's context to pull in. Clean snapshots mask these; noisy snapshots catch them — this is the canonical Datadog regression this platform was built to catch.
  • Closes the eval↔prod gap. Without noise injection, eval scores are systematically optimistic relative to production; with it, scores track production.

Costs

  • Short-term regression in headline numbers. Datadog saw a ~11% drop in pass rate and ~35% drop in label count when they regenerated early narrow labels with wider scope. Narrow labels whose source telemetry had already expired were unrecoverable — a one-way door.
  • Harder trajectory scoring. An agent that correctly ignored a red herring must be rewarded, not punished. Scorers / judges need to be rich enough to distinguish "surfaced irrelevant signal" from "surfaced a signal it should have surfaced but shouldn't have acted on."
  • More expensive per replay. More signals means more data to reconstruct in the simulated env, larger agent context during replay, higher per-eval inference cost.

Pre-requisites

  • The underlying telemetry graph must expose adjacency relationships (platform / team / monitor / name) cheaply enough to traverse at label-generation time.
  • Label storage must accommodate larger snapshots.
  • Judges / scorers need rubrics that reward correct ignoring, not only correct attention.

Seen in

  • sources/2026-04-07-datadog-bits-ai-sre-eval-platform — canonical case in the Bits AI SRE evaluation platform. "The most counterintuitive thing we learned was that our simulated worlds need to be messy… in production, Bits operates in environments full of unrelated services, background errors, and tangential signals. To reflect that reality, we capture more than the minimal signal needed to explain the issue."
Last updated · 200 distilled / 1,178 read