Skip to content

PATTERN Cited by 1 source

Bootstrap eval dataset from production traces

Bootstrap eval dataset from production traces is the pattern of materializing evaluation dataset records by SQL-querying durable production trace tables — using real user-traffic prompts (with real tool-call context, real agent execution paths) as the eval corpus instead of hand-curated synthetic test cases.

Mechanics

production traces ──► durable lakehouse trace tables (e.g. UC OTel Trace Tables)
                              │  SQL query: SELECT request, response, ...
                              │              FROM <prefix>_trace_unified
                              │              WHERE <filters>
                       SQL warehouse materialises rows
                       evaluation dataset
                       LLM judges score each record (built-in + custom)
                       eval results in MLflow experiment UI

The pattern requires:

  1. Durable trace storage — eval is meaningless if the trace evaporated. Lakehouse-resident tables (systems/uc-otel-trace-tables) are the canonical substrate.
  2. Queryable schema — the dataset materialisation step is a SQL query, so the trace schema must support filtering / projection.
  3. A judge implementation — built-in MLflow judges or custom-guideline judges; whichever scores the eval records.

Canonical instance: MLflow trace-bootstrap eval (Databricks, 2026-05-22)

"MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases. Below, we create an evaluation dataset from recently captured traces. MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment."

"With the dataset in place, we can define the judges that will score our application. MLflow provides a set of built-in judges, and also allows us to define custom guidelines tailored to our agent's expected behavior."

— Source: sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog

The architectural payoff: the same schema that powers production-monitoring dashboards also powers dev evaluation. Two consumers, one substrate.

Two evaluation modes from the same pattern

Mode Trigger Input Cadence
Dev-time eval Pre-release Bootstrapped dataset (historical prod traces) One-shot or release-gated
Continuous prod monitoring Live traffic Streaming traces matched by judge Continuous

"MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves."

The unification (same judges, same trace schema, same UC tables) is the structural payoff. Without unification, dev and prod eval drift apart over time.

Why prod-trace bootstrap beats synthetic

  • Distribution match: real user traffic captures the actual usage shape; synthetic test cases capture the author's mental model.
  • Long-tail coverage: edge cases that appear once-per-thousand-requests in production are visible to the bootstrap; they are invisible to a curated test set.
  • Tool-call shape preserved: the real sequence of tool calls (which retrieval, in what order, with what parameters) is captured verbatim; synthetic re-implementations approximate.
  • Continuous freshness: the bootstrap reflects current usage; a curated test set lags reality unless explicitly maintained.

When the pattern is wrong

  • Cold-start agents: no production traces yet, so nothing to bootstrap from. Synthetic test sets remain the starting point until traffic accumulates.
  • Adversarial categories: jailbreaks / prompt-injection / social-engineering rarely appear in normal user traffic. Adversarial test sets must complement the bootstrap.
  • Privacy-sensitive environments: prod prompts may contain user PII. The bootstrap pipeline needs to apply column masking / row filtering before judges see data.
  • Sparse error events: if only a fraction of traces represent the failure mode of interest, naïve bootstrap dilutes the signal — bias the SQL query toward error-tagged traces.

Composition with other patterns

Caveats

  • Sampling strategy is not specified. "Recently captured traces" is the post's framing — but recent ≠ representative. Operators must instrument bias-aware sampling (stratify by tool-path, error-status, latency tier, etc.).
  • Judge accuracy is the limiting factor. The post names "built-in or custom guidelines" but doesn't benchmark judge agreement with humans. The 2026-05-13 Claroty CSAF post (separate ingest) explicitly mandates a conservative pass/fail/unknown judge ternary; no such discipline is described here.
  • Distribution drift between bootstrap and use. A six-month-old bootstrap may no longer represent current usage. Continuous monitoring with the same judges catches this on the prod side; dev-eval refresh must be explicit.
  • Cost of LLM-judge scoring. Continuous evaluation across all prod traffic is expensive; teams typically sample. The post does not address cost-aware sampling.
  • Privacy. Prod prompts often contain PII. UC column masking should be applied before judge inputs leave the trace tables — but the pattern does not enforce this; operators must.

Seen in

Last updated · 542 distilled / 1,571 read