Skip to content

CONCEPT Cited by 1 source

Production traces as evaluation substrate

Production traces as evaluation substrate is the architectural property of using durable, queryable production traces as the source of truth for evaluation datasets — instead of (or in addition to) hand-curated synthetic test suites. The argument: real user-traffic prompts represent the scenarios the agent must actually handle better than synthetic test cases ever can, and durable storage of those traces makes them harvestable on demand.

Canonical statement (Databricks, 2026-05-22)

"MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases."

"MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment."

"MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves."

— Source: sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog

Why production traces beat synthetic test cases

Property Synthetic test cases Production traces
Distribution match Author's mental model of usage The actual usage distribution
Coverage of edge cases Limited to what the author anticipates Covers the long tail organically
Update cadence Manual; lags real usage Continuous; reflects current usage shape
Adversarial inputs Hard to anticipate Captured if they happen in production
Tool-call shape Re-implemented in tests Exact production tool-call sequences preserved

The structural insight: evaluation dataset quality is bottlenecked by the curator's imagination unless you bootstrap from reality.

What enables the substrate property

For production traces to serve as evaluation substrate, three properties must hold:

  1. Durable storage: traces must persist beyond the APM-typical "hot for hours, gone after days" retention. Lakehouse-resident trace tables (systems/uc-otel-trace-tables) provide this — "Previous limits on traces per experiment are no longer applicable".
  2. Queryable schema: traces must be selectable / filterable to materialize into an evaluation dataset. UC OTel trace tables expose the _otel_spans and _trace_unified views via SQL.
  3. Stable instrumentation boundary: the trace shape must be portable enough that prod traces from yesterday remain comparable to today's instrumentation — concepts/instrumentation-storage-decoupling via OTel provides this.

Without all three, prod-traces-as-eval is impractical: ephemeral retention forces re-running scenarios; non-queryable storage forces manual scripting; unstable instrumentation makes historical traces incompatible with current judges.

The closed loop

production traces ──► UC OTel trace tables (durable, queryable)
                       SQL warehouse: search + materialize
                       evaluation dataset records
                  ┌───────────┼───────────┐
                  │           │           │
                  ▼           ▼           ▼
           dev evaluation  prod monitor  drift alerts
              (pre-release) (continuous) (regression)
                  │           │           │
                  └───────────┴───────────┘
                       MLflow LLM judges
                       (built-in + custom guidelines)

Two evaluation modes from the same substrate

Mode When Input Property
Development eval Pre-release Bootstrapped historical-prod-traces dataset Validates behaviour before deployment; baseline for regression detection
Production monitoring Continuous, post-release Live trace stream Detects regressions / drift / emerging failure patterns "as the application evolves"

The structural payoff: the same judge code, the same trace schema, the same UC tables back both modes. There is no "dev eval pipeline" and "prod eval pipeline" to keep in sync — they are the same pipeline running over different time windows.

Why this is a non-trivial design choice

Many evaluation stacks treat dev and prod as separate concerns:

  • Dev eval runs against a static curated test set, scored offline.
  • Prod monitoring uses operational metrics (latency, error rate, token usage) without LLM-judge scoring.

The unified substrate collapses this gap. The cost: the LLM-judge cost runs continuously. The benefit: regressions surface in production with the same vocabulary as dev evaluation — which makes the "is this regression real or is it the eval changing" question trivially answerable.

Caveats

  • Privacy. Production traces contain user-supplied prompts which often include PII. Using them as evaluation inputs requires the same column-masking / row-filtering posture that governs the trace tables generally — see systems/unity-catalog.
  • Sampling vs full retention. "Bootstrap" implies sampling from prod traces — the post does not specify how to choose the sample, and biased sampling (e.g. only sampling errored traces) produces a biased eval dataset.
  • Distribution drift between bootstrap and use. A dataset bootstrapped six months ago may no longer reflect current usage. Continuous monitoring with the same judges addresses this on the prod side, but the dev eval side needs explicit refresh discipline.
  • Judge quality is the limiting factor. "Built-in or custom guidelines" — but if the judge is wrong, the eval is wrong. The 2026-05-13 Claroty CSAF post (separate ingest) explicitly mandates pass/fail/unknown ternary judges to mitigate; this 2026-05-22 post does not specify any such discipline.
  • Prod-traces-as-substrate doesn't replace adversarial testing. Some categories of failure (jailbreaks, prompt-injection, social-engineering) are rarely surfaced in normal user traffic — explicit adversarial test sets remain necessary as a complement.
  • Cost. Continuous LLM-judge scoring on every prod trace can be expensive; teams typically sample. The post does not address sampling strategy.

Relationship to other concepts

  • Substrate for patterns/bootstrap-eval-dataset-from-production-traces — the canonical pattern that operationalises this concept.
  • Composes with concepts/llm-as-judge — judges are the scoring primitive that consumes the substrate.
  • Enabled by concepts/lakehouse-native-observability — durable, queryable trace storage is the prerequisite.
  • Sibling to the snapshot-replay-agent-evaluation pattern at Databricks (existing [patterns/snapshot-replay-agent-evaluation](<../patterns/snapshot-replay-agent-evaluation.md>)-style flows) — that pattern uses synthetic-or-curated snapshots; this concept uses prod traces directly.

Seen in

Last updated · 542 distilled / 1,571 read