Skip to content

SYSTEM Cited by 1 source

MLflow OTel Tracing

MLflow OTel Tracing is the agent-instrumentation surface within MLflow (3.x) that emits OpenTelemetry-format traces and routes them — via Zerobus Ingest — into UC OTel Trace Tables in the customer's Unity Catalog. It is the framework-side companion to the lakehouse-resident storage layer; together they implement the "observability for any agent, anywhere" promise of the 2026-05-22 launch.

Two instrumentation modes

From the source:

"You can do automatic and/or manual tracing… In our example, we rely on mlflow.langchain.autolog() to capture the detailed LangGraph execution (model calls and tool calls). We also wrap the entrypoint with @MLflow.trace to establish a request-level root span, allowing each invocation to be observed as a single end-to-end execution." — Source: sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog

Mode Mechanism What it captures
Automatic Library-specific autolog (mlflow.langchain.autolog(), etc.) Detailed per-call execution: model calls, tool calls, retrieval, intermediate steps
Manual @MLflow.trace decorator on the entrypoint function Request-level root span; binds the whole invocation as a single end-to-end execution

The composition (autolog + manual root) is the recommended shape: autolog gives you the inner spans for free; the manual root gives every trace a stable, queryable boundary.

Trace-table provisioning

MLflow is also the table-creation surface for the schema in systems/uc-otel-trace-tables:

"In this example, we use MLflow to create the underlying OpenTelemetry tables in Unity Catalog and link them to an MLflow experiment so traces can be searched, analyzed, and annotated from the UI. Start by identifying (or creating) a SQL warehouse and an MLflow experiment, then use the MLflow Python library to provision the Unity Catalog tables and associate the schema with the experiment."

Setup chain:

  1. Identify or create a SQL warehouse.
  2. Identify or create an MLflow experiment.
  3. Use the MLflow Python library to provision the six UC tables/views.
  4. Associate the schema with the experiment.
  5. Point any OTLP client at the resulting endpoint via Zerobus REST or gRPC.

After this one-time setup, "agent instrumentation remains the same. Any OTel-compatible instrumentation library can export traces to the configured endpoint."

Where it sits in the stack

agent code (LangGraph / OpenAI SDK / Anthropic SDK / framework-agnostic)
        │  @MLflow.trace decorator (manual root span)
        │  mlflow.<library>.autolog() (automatic per-call spans)
   OTel SDK (per-language)
        │  OTLP/gRPC or REST
   Zerobus Ingest (managed)
   UC OTel Trace Tables (Delta-backed)
   MLflow Experiment UI ── search / drill / annotate / judge-score
   SQL / Genie / dashboards / ETL ── the broader lakehouse consumer set

Decoupling property: agents can run anywhere

The structural payoff:

"the agent can be running anywhere. In fact the support assistant agent example that was used for this blog is deployed locally." (FAQ)

The instrumentation library + OTel + Zerobus's REST endpoint together constitute a portable observability boundary — agents in customer VPCs, on developer laptops, in third-party clouds, or inside Databricks Apps all emit to the same UC tables. The agent runtime is not coupled to Databricks. This is the canonical instance of concepts/instrumentation-storage-decoupling applied to MLflow.

Closing the loop: production traces → evaluation

MLflow OTel Tracing is also the substrate for MLflow's evaluation flow:

"MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases."

"MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment."

And in production:

"MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves."

Canonical instances of patterns/bootstrap-eval-dataset-from-production-traces and concepts/production-traces-as-evaluation-substrate.

Reference instrumentation (article's example)

The 2026-05-22 post's reference agent — "Support Manager Assistant":

  • Framework: LangGraph (deployed locally, outside Databricks).
  • Model: Databricks-hosted Claude Sonnet 4.6 (via Foundation Model APIs).
  • Tool: Genie Space over the MCP tool API for SQL-driven Q&A.
  • Instrumentation: mlflow.langchain.autolog() + @MLflow.trace on the entrypoint.
  • Sample query: "Which support engineer should I put up for promotion?" — agent makes 3 Genie tool calls + final summarisation; trace surfaces 3 tool spans + 1 root span + LLM-call spans.

Native dashboards (MLflow Experiment UI)

"The MLflow Experiment UI now ships with native observability dashboards for traces in Unity Catalog, including views for trace volume, errors, latency, token usage, and cost. For most teams, that's enough to monitor day-to-day agent health."

Five default dashboard views:

View Native granularity
Trace volume Per experiment / time window
Errors Per error type / time
Latency Trace-level P50 / P99 (extend to span-level via patterns/component-level-latency-from-otel-spans)
Token usage Per model / time
Cost List-price; extend with custom SQL for contract pricing

When the native views aren't enough, "the trace tables are still just Delta tables" — custom AI/BI dashboards on top of the same UC tables are the escape hatch.

Caveats

  • Tied to the MLflow experiment model. Customers who don't want experiment-scoped trace organisation must work around it; the post focuses on experiment-attached flows.
  • Autolog quality is library-specific. mlflow.langchain.autolog() is mature; coverage for less-common frameworks is unstated.
  • Manual root-span hygiene is on the developer. Without @MLflow.trace on the entrypoint, traces fragment — autolog-only setups will surface inner spans without a clean trace boundary.
  • Judge integration assumes high-quality LLM judges. The post names "built-in or custom guidelines" but does not benchmark judge accuracy. Compare with the 2026-05-13 Claroty CSAF ingest's deliberately-conservative pass/fail/unknown ternary.
  • No throughput / latency SLO for the instrumentation path itself. Agent-side overhead of @MLflow.trace + autolog is not characterised.
  • Setup requires SQL warehouse provisioning in addition to an experiment — non-trivial for teams new to Databricks.

Seen in

Last updated · 542 distilled / 1,571 read