Bits AI SRE evaluation platform¶
Datadog's retrospective on building the offline, replayable evaluation platform for Bits AI SRE, its autonomous incident-investigation agent. Not the agent itself — the infrastructure that makes the agent's behaviour observable, measurable, and repeatable across changes.
Summary¶
Bits AI SRE reasons across metrics / logs / traces / infra metadata / network telemetry / monitor config to triage and remediate production incidents. Early in development the team shipped features that helped on hand-picked internal test cases and quietly regressed many other scenarios — e.g. auto-extracting the monitor's service name into Bits' initial context pulled in irrelevant signals and subtly confused reasoning on unrelated investigations. With no representative evaluation set, the regressions only surfaced as widespread internal investigation misses.
Two unsuccessful prior attempts bracket the problem:
- Per-tool isolated testing — assumed compositional correctness (each tool correct → agent correct). False. Most real failures come from how Bits chains tools and reasons across their outputs — e.g. valid signals from multiple tools combined incorrectly into a wrong-component attribution.
- Re-running live production investigations — doesn't scale: results not aggregated, environments drift, telemetry expires, no replay.
The answer is an offline system that replays realistic scenarios across Datadog's signals and scores agent behaviour in a controlled, repeatable way. Off-the-shelf eval frameworks assume clean inputs and static test sets; they break against agents that reason over live production telemetry.
Two components, built in tandem:
- A curated label set defining representative investigation scenarios.
- An orchestration platform that executes Bits against the labels and scores the results.
Key takeaways¶
-
An evaluation label has two parts: ground truth + world snapshot. Ground truth = the actual root cause (e.g. "Kubernetes pod OOM killed"). World snapshot = the signals that existed at the moment the issue occurred — the telemetry queries the agent would need (where to find memory metrics, container logs, deployment events), not the raw telemetry itself. The agent never sees the ground truth; it only has access to the world snapshot, mirroring the production constraint (Source: article body).
-
Simulated worlds must be noisy, not clean. Snapshotting only signals directly tied to the root cause gives the agent an "open-book exam with only the relevant pages" — it aces evaluation and underperforms in production. Snapshots expand to include unrelated components on the same platform / team / monitor / naming cluster even though they're not part of the failure. This injects real-world red herrings into the evaluation. Without this, eval results over-state agent quality vs. live investigations.
-
Label creation scaled by embedding it in the product. Hand-crafted labels burnt engineering hours faster than they produced coverage. The evolution:
- Manual internal labelling from Datadog's own alerts.
- Customer feedback as label source — every user interaction with Bits (thumbs-up / thumbs-down / free-text feedback) becomes a candidate label: the feedback + investigation telemetry → ground-truth root cause + world snapshot queries. Label creation rate increased by an order of magnitude.
-
Bits validates its own labels — the same agent that investigates prod now assists the label pipeline: aggregates signals, derives causal relationships, turns ambiguous feedback ("it was slow") into precise statements ("elevated latency in service X"), builds a full root-cause chain from problem statement to underlying cause. Alignment studies with human judges gated the trust transition. Validation time per label dropped >95% in a single week. Human role shifted from assembling RCAs from raw signals to validating/refining agent output.
-
Label quality is itself scored. Each generated label gets a confidence score across thoroughness / specificity / accuracy dimensions; anything below threshold is flagged for human review. Labels that hold up under a "5 Whys" postmortem-style RCA are ~30% higher quality than earlier labels. Higher-quality labels enable trajectory evaluation rather than just final-answer scoring: how close did the agent get, did it investigate deeply, did it surface valuable telemetry?
-
Telemetry expiry is a one-way door. Once the underlying telemetry TTL passes, the structure and signal relationships can't be reconstructed. When early labels turned out to be too narrow (missing noise), regenerating them with broader scope was only possible on labels whose source telemetry hadn't expired yet. Short-term cost: pass rate dropped ~11%, label count dropped ~35% as narrow labels were discarded. Long-term gain: evaluations became predictive of production behaviour. See concepts/telemetry-ttl-one-way-door.
-
Orchestration platform segments labels across multiple axes — technology, problem type, monitor type, investigation difficulty — so engineers evaluate changes against the scenarios that matter for their workstream without interfering with others. Results stored per-scenario per-run; tracked in Datadog dashboards and Datadog LLM Observability so performance can be compared across agent versions over time. Full evaluation set runs weekly; targeted runs during feature iteration. Regressions alert to Slack on significant deviation.
-
pass@k as a quality dimension. For a given scenario, over k independent attempts, does the agent succeed on at least one? Non-determinism of LLM agents makes single-attempt pass/fail noisy; pass@k separates capability from luck.
-
Frontier-driven label expansion. "The labels that matter most aren't the ones Bits passes. They're the ones it fails." Segmented failure analysis identifies weak domains → expand label set in those domains, specifically mining negative feedback and hard scenarios. Labels are sometimes created for capabilities the agent doesn't yet support, so the eval suite is built alongside the feature rather than retrofitted.
-
New models are evaluated upfront, not discovered in production. When Claude Opus 4.5 became available, Datadog ran it against the full label set within days and identified which investigation types improved and which regressed. Without the platform, those shifts would have been discovered after rollout.
-
The platform generalises beyond one agent. Agentic label collection now extends into everyday SWE workflows at Datadog — internal incidents / issues / alerts become evaluation labels — bootstrapping other teams (APM, Database Monitoring) with representative label sets and eval infrastructure from day one.
Architecture (as described)¶
- Shared label set: ground-truth RCA + world-snapshot telemetry queries; segmented by dimension (tech / problem / monitor / difficulty).
- Orchestration layer: runs agent configs against labels at scale, in parallel, across model/config variants.
- Simulated environment per label: reconstructs investigation context from world-snapshot queries, isolates at the data layer (one env can't leak into another), expanded to include adjacent noisy components not directly involved in the failure.
- Scoring: final-answer correctness + trajectory (depth, telemetry surfaced, distance to correct answer) + pass@k.
- Reporting: Datadog dashboards + LLM Observability + internal labelling-app for centralised label metadata + Slack alerts.
┌──────────────┐ feedback + investigation telemetry ┌──────────────┐
│ Bits (prod) │ ──────────────────────────────────────▶│ Label pipe- │
│ investigates │ │ line: Bits- │
│ real incident│ │ assisted RCA │
└──────────────┘ │ + scoring + │
│ human review │
└──────┬───────┘
│ labels
▼
┌─────────────────────┐ reconstruct world-snapshot ┌──────────────┐
│ Simulated env │ ◀──────────────────────────────│ Label set │
│ (noisy, isolated) │ │ (segmented) │
└─────────┬───────────┘ └──────────────┘
│ replay ▲
▼ │
┌─────────────────────┐ score (final + trajectory + pass@k) │
│ Candidate agent │────────────────────────────────────────┘
│ config / model │ → dashboards, LLM Observability, Slack
└─────────────────────┘
Numbers reported¶
- Bits runs against tens of thousands of scenarios per weekly full-set run.
- Validation time per label ↓ >95% after agent-assisted validation came online.
- Label quality (5-Whys-passing RCAs) ↑ ~30%.
- Early too-narrow labels: regenerating caused −11% pass rate, −35% label count (short-term). Worth it: evals now predictive of prod.
- New model (Claude Opus 4.5) fully evaluated within days of availability; per-domain improvement/regression breakdown in hand before rollout decision.
Architectural caveats¶
- No latency/throughput numbers for the orchestration tier itself.
- No ground-truth disclosure mechanism — how the agent is prevented from seeing the root cause in the snapshot is asserted but not mechanically detailed.
- Data-layer isolation between simulated environments claimed but not described.
- Model-as-judge bias not discussed explicitly in the post (though the alignment studies with human judges gate trust for the label pipeline, not the eval scorer).
- Vendor-blog framing noted; architecture substantive enough to ingest on Tier-3 bar (named scaling trade-offs, quantified outcomes, reusable patterns).
Extracted¶
- New system: systems/bits-ai-sre-eval-platform — the platform itself.
- New concepts: concepts/evaluation-label, concepts/trajectory-evaluation, concepts/pass-at-k, concepts/noise-injection-in-evaluation, concepts/telemetry-ttl-one-way-door.
- New patterns: patterns/product-feedback-to-eval-labels, patterns/agent-assisted-label-validation, patterns/noisy-simulated-evaluation-environment.
- Updates: systems/bits-ai-sre gains the eval-platform companion; patterns/snapshot-replay-agent-evaluation gains the ground-truth-separation + world-snapshot refinement and the noise-injection + trajectory-evaluation extensions; concepts/llm-as-judge gains the trajectory-scoring usage mode and the label-pipeline usage (not just the eval-scorer role).