CONCEPT Cited by 1 source
Production traces as evaluation substrate¶
Production traces as evaluation substrate is the architectural property of using durable, queryable production traces as the source of truth for evaluation datasets — instead of (or in addition to) hand-curated synthetic test suites. The argument: real user-traffic prompts represent the scenarios the agent must actually handle better than synthetic test cases ever can, and durable storage of those traces makes them harvestable on demand.
Canonical statement (Databricks, 2026-05-22)¶
"MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases."
"MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment."
"MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves."
— Source: sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog
Why production traces beat synthetic test cases¶
| Property | Synthetic test cases | Production traces |
|---|---|---|
| Distribution match | Author's mental model of usage | The actual usage distribution |
| Coverage of edge cases | Limited to what the author anticipates | Covers the long tail organically |
| Update cadence | Manual; lags real usage | Continuous; reflects current usage shape |
| Adversarial inputs | Hard to anticipate | Captured if they happen in production |
| Tool-call shape | Re-implemented in tests | Exact production tool-call sequences preserved |
The structural insight: evaluation dataset quality is bottlenecked by the curator's imagination unless you bootstrap from reality.
What enables the substrate property¶
For production traces to serve as evaluation substrate, three properties must hold:
- Durable storage: traces must persist beyond the APM-typical "hot for hours, gone after days" retention. Lakehouse-resident trace tables (systems/uc-otel-trace-tables) provide this — "Previous limits on traces per experiment are no longer applicable".
- Queryable schema: traces must be selectable / filterable to materialize into an evaluation dataset. UC OTel trace tables expose the
_otel_spansand_trace_unifiedviews via SQL. - Stable instrumentation boundary: the trace shape must be portable enough that prod traces from yesterday remain comparable to today's instrumentation — concepts/instrumentation-storage-decoupling via OTel provides this.
Without all three, prod-traces-as-eval is impractical: ephemeral retention forces re-running scenarios; non-queryable storage forces manual scripting; unstable instrumentation makes historical traces incompatible with current judges.
The closed loop¶
production traces ──► UC OTel trace tables (durable, queryable)
│
▼
SQL warehouse: search + materialize
│
▼
evaluation dataset records
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
dev evaluation prod monitor drift alerts
(pre-release) (continuous) (regression)
│ │ │
└───────────┴───────────┘
│
▼
MLflow LLM judges
(built-in + custom guidelines)
Two evaluation modes from the same substrate¶
| Mode | When | Input | Property |
|---|---|---|---|
| Development eval | Pre-release | Bootstrapped historical-prod-traces dataset | Validates behaviour before deployment; baseline for regression detection |
| Production monitoring | Continuous, post-release | Live trace stream | Detects regressions / drift / emerging failure patterns "as the application evolves" |
The structural payoff: the same judge code, the same trace schema, the same UC tables back both modes. There is no "dev eval pipeline" and "prod eval pipeline" to keep in sync — they are the same pipeline running over different time windows.
Why this is a non-trivial design choice¶
Many evaluation stacks treat dev and prod as separate concerns:
- Dev eval runs against a static curated test set, scored offline.
- Prod monitoring uses operational metrics (latency, error rate, token usage) without LLM-judge scoring.
The unified substrate collapses this gap. The cost: the LLM-judge cost runs continuously. The benefit: regressions surface in production with the same vocabulary as dev evaluation — which makes the "is this regression real or is it the eval changing" question trivially answerable.
Caveats¶
- Privacy. Production traces contain user-supplied prompts which often include PII. Using them as evaluation inputs requires the same column-masking / row-filtering posture that governs the trace tables generally — see systems/unity-catalog.
- Sampling vs full retention. "Bootstrap" implies sampling from prod traces — the post does not specify how to choose the sample, and biased sampling (e.g. only sampling errored traces) produces a biased eval dataset.
- Distribution drift between bootstrap and use. A dataset bootstrapped six months ago may no longer reflect current usage. Continuous monitoring with the same judges addresses this on the prod side, but the dev eval side needs explicit refresh discipline.
- Judge quality is the limiting factor. "Built-in or custom guidelines" — but if the judge is wrong, the eval is wrong. The 2026-05-13 Claroty CSAF post (separate ingest) explicitly mandates pass/fail/unknown ternary judges to mitigate; this 2026-05-22 post does not specify any such discipline.
- Prod-traces-as-substrate doesn't replace adversarial testing. Some categories of failure (jailbreaks, prompt-injection, social-engineering) are rarely surfaced in normal user traffic — explicit adversarial test sets remain necessary as a complement.
- Cost. Continuous LLM-judge scoring on every prod trace can be expensive; teams typically sample. The post does not address sampling strategy.
Relationship to other concepts¶
- Substrate for patterns/bootstrap-eval-dataset-from-production-traces — the canonical pattern that operationalises this concept.
- Composes with concepts/llm-as-judge — judges are the scoring primitive that consumes the substrate.
- Enabled by concepts/lakehouse-native-observability — durable, queryable trace storage is the prerequisite.
- Sibling to the snapshot-replay-agent-evaluation pattern at Databricks (existing
[patterns/snapshot-replay-agent-evaluation](<../patterns/snapshot-replay-agent-evaluation.md>)-style flows) — that pattern uses synthetic-or-curated snapshots; this concept uses prod traces directly.
Seen in¶
- sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog — first wiki disclosure; the "bootstrap this dataset from real traces" claim, the "better represent the scenarios" argument, and the "automatically evaluate live traces using the same judges" continuous monitoring framing are all named here.
Related¶
- concepts/llm-as-judge — scoring primitive.
- concepts/observability — parent concept.
- concepts/single-sink-telemetry-architecture — ingest-side enabler.
- concepts/instrumentation-storage-decoupling — stable-instrumentation enabler.
- concepts/lakehouse-native-observability — durable-storage enabler.
- systems/mlflow — evaluation surface.
- systems/mlflow-otel-tracing — instrumentation companion.
- systems/uc-otel-trace-tables — substrate-storage system.
- systems/zerobus-ingest — substrate-ingest engine.
- patterns/bootstrap-eval-dataset-from-production-traces — canonical operationalising pattern.
- patterns/llm-judge-as-inline-pipeline-stage — sibling pattern (judge applied during ETL).
- patterns/telemetry-to-lakehouse — broader pattern that makes the substrate property tractable.
- companies/databricks