PATTERN Cited by 2 sources
Snapshot-replay agent evaluation¶
Capture snapshots of production-state inputs (queries, tool responses, intermediate state) from real agent runs, then replay them through candidate agent configurations (new prompts, new tool sets, new model versions). Score the replayed outputs with an LLM judge against the reference. The result is a regression harness that works in spite of LLM non-determinism.
Intent¶
Classical unit tests assume deterministic outputs. LLM agents produce different outputs for the same inputs across runs, and often across model revisions or prompt tweaks. Before this harness exists:
- Teams rely on manual eyeballing of diff tables → slow, low coverage, subjective.
- Prompt tweaks that seem to help on sample queries silently regress on edge cases.
- Tool additions that look like pure wins introduce selection errors on nearby queries.
Snapshot-replay gives quantitative signal: "does this change score better/worse than the last release candidate on N saved scenarios?"
Mechanism¶
- Instrument the agent to record inputs, each tool call + response, and the final output. In practice: the existing tracing (e.g. MLflow / OpenTelemetry spans) is the snapshot format.
- Curate a snapshot set. A mix of representative production traces — ideally including known-good resolutions, near-miss incidents, and edge cases that surfaced real regressions historically.
- Replay each snapshot through a candidate configuration. The agent's tool calls may execute for real (expensive) or be replayed from recorded responses (cheap, what the Databricks post implies).
- Score via concepts/llm-as-judge. The judge is given (a) the input, (b) the candidate's output, and (c) a rubric (accuracy, helpfulness, safety) plus optionally the reference output. It returns a numeric score + justification.
- Aggregate and compare scores across configurations. Regression = mean / percentile drop on a fixed set.
Why it works¶
- Deterministic inputs, probabilistic outputs, quantitative scoring. The harness doesn't eliminate non-determinism — it quantifies it.
- Fast iteration. Changes become cheap to evaluate, so engineers explore more alternatives.
- Catches "improvements" that regress somewhere. A prompt tweak tuned on one query type is automatically measured on the rest.
Tradeoffs¶
- Snapshot freshness. Production drifts; snapshots age. A stale corpus can look "fine" while real traffic quietly degrades. Continuous snapshot refresh is needed.
- Replay fidelity. If tools have side effects or time-sensitive outputs, recorded tool responses may no longer match reality. This is fine for a correctness harness, bad for a live-behavior harness.
- Judge-LLM bias. Any bias in the judge skews the leaderboard. Rubrics must be specific; periodic human spot-checks recalibrate.
- Cost. Replay + judge = 2× inference per eval run.
- Not a safety proof. Passes here do not prove the agent won't recommend a dangerous action in production — mutating-action safety needs separate guardrails.
Refinements from Datadog's Bits AI SRE evaluation platform¶
Datadog's systems/bits-ai-sre-eval-platform materially extends the basic snapshot-replay shape on five axes, all of which generalise (Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform):
- Ground-truth / world-snapshot split as load-bearing. The snapshot unit is an evaluation label with two explicit halves: a ground truth (root cause / expected resolution, never shown to the agent) and a world snapshot (the queries that reconstruct the signals the agent would have seen in production). Storing queries, not raw telemetry, lets the world survive vendor / data-store TTLs.
- Noise must be injected into the simulated environment. Contain only the signals directly tied to the root cause and you give the agent an open-book exam with the relevant pages. See patterns/noisy-simulated-evaluation-environment. Datadog paid for this lesson explicitly: widening early narrow labels cost ~11% pass rate and ~35% labels short-term in exchange for evals that are predictive of production behaviour.
- Trajectory scoring, not just final-answer scoring. concepts/trajectory-evaluation scores how the agent investigated — depth, telemetry surfaced, distance to correct answer — once label quality is postmortem-grade. Gives partial credit and surfaces reasoning-shape regressions terminal scoring misses.
- pass@k, not pass@1. concepts/pass-at-k separates agent capability from sampling stability. Low pass@1 + high pass@k is a different problem from low pass@k.
- Labels are generated from the product surface, not hand-crafted. patterns/product-feedback-to-eval-labels turns every user interaction into a candidate label; scale ties to adoption. patterns/agent-assisted-label-validation uses the agent itself to validate candidates once alignment with human judges clears a bar, shifting humans from RCA-assembly to RCA-refinement (validation time ↓ >95% in one week at Datadog).
- One-way-door telemetry. Telemetry you don't snapshot now is telemetry you can't snapshot later — see concepts/telemetry-ttl-one-way-door. This forces over-snapshotting as a design discipline.
Seen in¶
- sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks'
Storex team: "How do we prove the agent is getting better without
introducing regressions? ... we created a validation framework that
captures snapshots of the production state and replays them through
the agent, using a separate 'judge' LLM to score the responses for
accuracy and helpfulness as we modify the prompts and tools."
Referenced against MLflow 3's
judgesprimitive. - sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog's retrospective on the offline evaluation platform for Bits AI SRE. Canonical source for the ground-truth / world-snapshot split, noise injection, trajectory scoring, pass@k, product-feedback-driven label creation, and agent-assisted label validation. Runs against tens of thousands of scenarios weekly; used to evaluate new models (e.g. Claude Opus 4.5) across domains within days of availability.
Related¶
- concepts/llm-as-judge — the scoring primitive inside the harness.
- systems/bits-ai-sre-eval-platform — Datadog's productionised instance of this pattern.
- systems/mlflow — Databricks-stack host for judges + tracing.
- systems/storex — Databricks production consumer of this pattern.
- concepts/evaluation-label, concepts/trajectory-evaluation, concepts/pass-at-k, concepts/noise-injection-in-evaluation, concepts/telemetry-ttl-one-way-door — refinements the Datadog platform adds to the basic pattern.
- patterns/product-feedback-to-eval-labels, patterns/agent-assisted-label-validation, patterns/noisy-simulated-evaluation-environment — companion patterns that make snapshot-replay scale.
- patterns/tool-decoupled-agent-framework — the iteration surface that creates the need for regression harnessing.