SYSTEM Cited by 1 source

Bits AI SRE evaluation platform¶

The Bits AI SRE evaluation platform is the offline, replayable test harness for Datadog's Bits AI SRE agent. Not the agent itself — the infrastructure that makes agent behaviour observable, measurable, and repeatable across changes, built because "we had no reliable way to detect" quality shifts when features improved one investigation class while quietly regressing another (Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform).

Why it had to exist¶

Two prior attempts failed:

Per-tool isolated testing. Assumed compositional correctness: each tool correct → agent correct. False. Most real failures come from how Bits chains tools and reasons across their outputs — valid signals from multiple tools composed into a wrong-component attribution.
Re-running live investigations. Didn't scale: results not aggregated, environments drifted, telemetry expired, no replay primitive.

The forcing function was the canonical regression story: an early feature that extracted the triggering monitor's service name into Bits' initial context worked on hand-picked test cases and quietly degraded unrelated investigations by pulling in irrelevant signals. No representative evaluation set → no way to detect.

Architecture¶

Two components in tandem:

Curated label set — representative investigation scenarios.
Orchestration platform — runs the agent against labels at scale, scores results, tracks history.

The evaluation label¶

An evaluation label has two parts:

Part	Content	Agent sees it?
Ground truth	The actual root cause (e.g. "Kubernetes pod OOM killed")	No
World snapshot	The signals that existed at the moment the issue occurred — queries telling where to find memory metrics, container logs, deployment events	Yes

The agent is evaluated under the same information constraint it faces in production: access to the world snapshot, not the answer.

Simulated environments are noisy, not clean¶

Per-label simulated env, isolated at the data layer (one env's context can't affect another). Critical design choice: snapshot more than the minimal signal needed to explain the root cause — include unrelated components on the same platform / team / monitor / naming cluster. This injects real-world red herrings (patterns/noisy-simulated-evaluation-environment). Without it, "we were giving the agent an open-book exam with only the relevant pages" — eval scores over-stated agent quality vs. live investigations.

Label pipeline (evolution)¶

Manual internal labelling from Datadog's own alerts.
Product feedback → labels: every Bits user interaction (thumbs-up/-down + free-text feedback) becomes a candidate label. Label creation rate ↑ order of magnitude.
Bits validates its own labels: the same agent that investigates in prod assists the label pipeline — aggregates signals, derives causal relationships, turns "it was slow" into "elevated latency in service X", builds a full root-cause chain. Alignment studies with human judges gated trust. Validation time per label ↓ >95% in one week. Humans shifted from assembling RCAs from raw signals to validating and refining agent output.

Label confidence scored across thoroughness / specificity / accuracy; sub-threshold flagged for human review. Result: labels that hold up under a "5 Whys" postmortem → ~30% higher-quality than earlier generations. Enables concepts/trajectory-evaluation (score how the agent investigated, not just the final answer).

Orchestration layer¶

Segmentation axes: technology / problem type / monitor type / investigation difficulty. Engineers iterate on segments that matter for their workstream.
Scoring: final-answer correctness + trajectory depth + pass@k (over k independent attempts, does the agent succeed on at least one?) — robust to LLM non-determinism.
Reporting: Datadog dashboards + Datadog LLM Observability + internal labelling-app for centralised metadata. Full set runs weekly; targeted runs during feature iteration. Significant regressions alert to Slack.

Numbers reported¶

Tens of thousands of scenarios per weekly full run.
Validation time per label ↓ >95% after agent-assisted validation came online.
Label quality (5-Whys-passing RCAs) ↑ ~30%.
Early narrow labels regenerated with wider scope: short-term −11% pass rate, −35% label count; long-term evaluations became predictive of production behaviour (see concepts/telemetry-ttl-one-way-door — once source telemetry TTL expired, regeneration was no longer possible, so narrow labels had to be discarded).
New model (Claude Opus 4.5) fully evaluated within days of availability; per-domain improvement/regression breakdown before rollout decision.

Relationship to other Datadog agent surfaces¶

	systems/bits-ai-sre	systems/datadog-mcp-server	This platform
Role	Specialized SRE agent	General-purpose MCP interface for customer agents	Offline test harness for Bits AI SRE
Inputs	Live production telemetry	Live production telemetry	Recorded world snapshots
Outputs	Investigation + remediation suggestion	Data in agent-friendly formats	Scores + per-scenario deltas over time
Non-determinism handled by…	n/a (runs once in prod)	n/a	pass@k + trajectory scoring + replay

Generalisation¶

The platform now extends beyond one agent: - Other Datadog teams (APM, Database Monitoring) use the same eval infra and label-collection pipeline from day one of their agentic features. - Internal engineering incidents / issues / alerts become evaluation labels, not just customer-facing Bits interactions. - Per-team personalisation of reasoning loops based on per-customer evaluation signal.

Seen in¶

sources/2026-04-07-datadog-bits-ai-sre-eval-platform — architectural retrospective; the canonical source.

systems/bits-ai-sre — the agent under test.
systems/datadog-mcp-server — sibling Datadog agent-infrastructure product; different surface, same organisation's philosophy.
patterns/snapshot-replay-agent-evaluation — the pattern this platform operationalises; this platform adds the ground-truth / world-snapshot split, noise injection, and trajectory scoring.
concepts/llm-as-judge — trust mechanism for the label pipeline (Bits-validates-labels) and a natural fit for scoring here.
concepts/evaluation-label, concepts/trajectory-evaluation, concepts/pass-at-k, concepts/noise-injection-in-evaluation, concepts/telemetry-ttl-one-way-door — concepts extracted from this source.
patterns/product-feedback-to-eval-labels, patterns/agent-assisted-label-validation, patterns/noisy-simulated-evaluation-environment — patterns extracted from this source.