Databricks — Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog¶
Summary¶
Databricks ships OTel-format trace ingestion direct to Unity Catalog Delta tables, decoupling agent instrumentation from storage so production traces become a first-class lakehouse dataset rather than a SaaS-observability silo. The mechanism is a managed serverless ingestion engine — Zerobus Ingest — that natively speaks OTLP/gRPC for open-source collectors and a REST API for application frameworks like MLflow, with a "single-sink" shape that "streams data directly to the lakehouse" and explicitly "bypass[es] intermediate message buses like Kafka". Spans, logs, and metrics land as governed Delta tables (six MLflow-managed views: <prefix>_otel_spans / _otel_logs / _otel_metrics / _otel_annotations / _trace_unified / _trace_metadata) under the same UC controls as enterprise data — column masking, row-level filtering, catalog/schema RBAC. Throughput begins at 200 QPS, storage is unlimited, and the previous "trace cap per experiment" MLflow constraint is eliminated. The post closes the loop on the data substrate: production traces bootstrap evaluation datasets for MLflow LLM-judge scoring during dev and live monitoring against the same judges in production. Customer disclosures: Experian (Eva virtual assistant + Latte email — "hundreds of thousands of traces"), Superhuman/Grammarly ("hundreds of thousands of traces per day", replaces a custom point solution), SmartSheet (two production agents in a "three-day co-build", "tens of thousands of evaluations"), The Standard (insurance underwriting + claims agents). Tier-3 product post that ingests on architecture grounds: single-sink managed OTel ingestion to a lakehouse is a structurally distinct shape from APM-vendor-backed agent observability and worth canonicalising as a system + pattern.
Key takeaways¶
- Single-sink architecture bypasses intermediate brokers. "With a 'single-sink' architecture, Zerobus Ingest simplifies observability by streaming data directly to the lakehouse. Existing OLTP-compatible collectors can point directly to this endpoint via gRPC, entirely bypassing intermediate message buses like Kafka." — the structural argument is one fewer hop, one fewer system to operate, one fewer schema-translation boundary. Canonicalised as concepts/single-sink-telemetry-architecture and patterns/managed-otel-ingestion-direct-to-lakehouse.
- OTel as the protocol-portable boundary between instrumentation and storage. "using the OTel standard to separate instrumentation from storage" — any OTLP-compatible client (LangGraph, OpenAI SDK, Anthropic SDK, framework-agnostic Python/JS/Go SDKs) exports to the same endpoint. "the agent can be running anywhere. In fact the support assistant agent example that was used for this blog is deployed locally." Canonicalised as concepts/instrumentation-storage-decoupling.
- Lakehouse-resident trace tables inherit governance for free. "Once traces are in tables, you can treat them like any other dataset: query them with SQL, build dashboards, run ETL pipelines, use tools like Genie, and apply governance controls such as PII masking." The data-classification → tag → column-mask / row-filter pipeline UC already runs over business data automatically applies to prompt/response payloads in trace tables. "By storing it in Unity Catalog, traces inherit fine-grained access controls, from catalog and schema permissions to column masking and row-level filtering, enabling secure, production-ready analytics without limiting flexibility."
- Six derived views are the load-bearing schema surface. "the MLflow service automatically creates Databricks SQL views alongside them that transform the OpenTelemetry data into an MLflow-friendly format for easier querying and analysis":
<prefix>_otel_spans(per-request span execution),<prefix>_otel_logs(structured log events),<prefix>_otel_metrics(numerical telemetry),<prefix>_otel_annotations(MLflow-specific tags / assessments / feedback / expectations / run links),<prefix>_trace_unified(one record per trace, raw spans + metadata),<prefix>_trace_metadata(MLflow tags / metadata / assessments grouped by trace ID — "more performant than the unified view when you only need MLflow trace metadata"). Canonicalised as systems/uc-otel-trace-tables. - Production traces are evaluation-dataset bootstrap. "One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases." MLflow uses a SQL warehouse to "search and materialize dataset records" from the trace tables; built-in or custom-guideline LLM judges then score against the bootstrapped dataset. Canonicalised as concepts/production-traces-as-evaluation-substrate and patterns/bootstrap-eval-dataset-from-production-traces.
- The same judges run in production, not just dev. "MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns." Evaluation becomes "an ongoing practice as the application evolves" rather than a release-gate one-shot — the lakehouse substrate makes this tractable because traces are durable analytical data, not ephemeral APM events.
- Lakehouse cost economics is the structural argument vs SaaS observability. Three named asymmetries, verbatim: "Retention economics: Agents generate massive text payloads. Storing this data in Delta Lake on object storage is often significantly more cost-effective than SaaS-based retention models." / "The PII deadlock: Sending raw prompts to third-party platforms can create InfoSec friction. Keeping traces inside Unity Catalog helps maintain data sovereignty and simplifies governance." / "Analytics, not just telemetry: While SaaS tools are strong for operational metrics like latency, the Lakehouse provides an analytics engine. You can join traces with business data, such as revenue and conversions, to understand real impact."
- Component-level latency requires per-tool span breakdown. "Native latency views show P50/P99 at the trace level. To go a layer deeper and see which tool is slow, we built a Tool Performance widget that breaks down latency (P50, P99) and error rates per individual tool in the agent (for example, retrieve_docs vs. generate_response). That tells us whether the LLM, a Genie tool call, or another step is the bottleneck." The OTel-spans table is the substrate; custom SQL is the implementation. Canonicalised as patterns/component-level-latency-from-otel-spans.
- Custom contract pricing is a SaaS-vs-lakehouse asymmetry. "Native cost metrics rely on standard list prices, which can be off for teams that have negotiated rates or run fine-tuned models with different pricing. Because we control the SQL, we embedded our pricing logic directly into the query. The dashboard tracks token usage by model type (for example, GPT 5.5 vs. Claude 4.6 Sonnet) and applies our contract rates to produce an Estimated Cost per Trace that reflects what we actually pay." — concrete operational example: "a single complex query that costs $0.50 because of a retrieval loop".
- Change Data Feed turns observability tables into ETL inputs. "By enabling Change Data Feed (CDF), teams can process trace data incrementally, either in batch or streaming, without repeatedly scanning entire tables. This makes it possible to operationalize observability. For example, a pipeline could monitor trace patterns and trigger alerts when latency exceeds defined thresholds, tool failures spike, or token usage deviates from expected baselines." Distinguishes from request-time guardrails: "this complements real-time protections such as AI Guardrails. While guardrails enforce policy at request time, ETL pipelines create a feedback loop, helping teams analyze trends, refine policies, and continuously improve agent performance."
- Auto liquid-clustering + materialized views are the disclosed perf primitives. From the FAQ: "With the latest product update, the tables are automatically liquid clustered to keep the data optimally organized. For larger trace volumes, however, you should create a materialized view on top of the derived views and incrementally refresh it to maintain query performance."
- No special PII handling out of the box — UC is the ask. "This feature does not apply any special handling to PII. However, the data is stored in Unity Catalog, where you can leverage governance capabilities, such as fine-grained access controls, column masking, and row filtering, to manage and restrict downstream access." Pushes the responsibility to the existing UC data-classification + tag + ABAC pipeline rather than re-implementing PII handling at the trace ingest layer.
- Throughput floor disclosed: 200 QPS. From the FAQ: "Ingestion throughput limit starts at 200 QPS. There is no limit on storage. Previous limits on traces per experiment are no longer applicable. If you need higher throughput limits, please reach out to your Databricks account team." The MLflow per-experiment trace cap is named as a constraint that this architecture eliminates.
- Customer signals — observability is replacing custom-built point solutions. Superhuman framing: "We're standardizing on MLflow tracing as the observability layer for all of our AI agents at Superhuman. We prefer the broader platform integration over building and maintaining a custom or point solution — that maintenance burden was a real pain point for our teams." SmartSheet: "During a three-day co-build with Databricks, we stood up two production agents using MLflow tracing, evaluations, custom judges, and labeling — and with traces stored in Unity Catalog, we can run tens of thousands of evaluations and iterate on quality with confidence as we scale." The Standard: "By governing traces in Unity Catalog alongside the rest of our data on the Databricks Data Intelligence Platform, we can query, monitor and iterate securely — without adding unnecessary complexity."
Architecture extracted¶
Ingestion topology¶
agent ──OTLP/gRPC──┐
(LangGraph, │
OpenAI SDK, ├──► Zerobus Ingest (managed serverless)
Anthropic SDK, │ OTLP receiver + REST API endpoints
framework-agnostic)│ │
│ ▼
MLflow client ──REST┘ UC-managed Delta tables
(otel_spans, _logs, _metrics,
_annotations,
_trace_unified, _trace_metadata)
│
▼
downstream consumers:
── MLflow experiment UI (search/drill/judge)
── SQL / dashboards / Genie spaces
── ETL pipelines (CDF-driven)
── evaluation datasets (bootstrap from traces)
The "single-sink" property is the explicit absence of an intermediate broker (Kafka / Pulsar / etc.) between the OTel-emitting agent and the Delta tables. "Zerobus Ingest acts as your high-throughput telemetry pipeline, handling ingestion and durability with zero infrastructure overhead."
Sample agent (article's reference implementation)¶
- LangGraph agent (deployed outside Databricks — "highlighting that trace ingestion is decoupled from where the agent runs").
- Powered by Databricks-hosted Claude Sonnet 4.6.
- Calls a Genie Space as a tool via the MCP tool API for SQL-driven Q&A.
- Instrumentation:
mlflow.langchain.autolog()for automatic tracing of model + tool calls;@MLflow.tracedecorator wraps the entrypoint as the request-level root span. - Reference query in the post: "Which support engineer should I put up for promotion?" — agent calls Genie three times, summarises performance metrics, returns recommendation. The trace surfaces all three Genie calls as separate spans.
Schema surface (six tables, derived from raw OTel)¶
| Table | Purpose |
|---|---|
<prefix>_otel_spans |
Detailed span-level execution data per request |
<prefix>_otel_logs |
Structured log/event data captured during execution |
<prefix>_otel_metrics |
Numerical telemetry captured during execution |
<prefix>_otel_annotations |
MLflow-specific (tags, assessments/feedback, expectations, run links) — not standard OTel |
<prefix>_trace_unified |
One record per trace; raw spans + metadata; load-bearing for ad-hoc SQL |
<prefix>_trace_metadata |
MLflow tags + metadata + assessments grouped by trace ID; "more performant than the unified view when you only need MLflow trace metadata" |
Auto liquid-clustered post-product-update; for very large volumes, recommendation is materialized view on top of the derived views with incremental refresh.
Three SaaS-vs-lakehouse asymmetries (named, verbatim)¶
| Asymmetry | Verbatim quote |
|---|---|
| Retention economics | "Agents generate massive text payloads. Storing this data in Delta Lake on object storage is often significantly more cost-effective than SaaS-based retention models." |
| The PII deadlock | "Sending raw prompts to third-party platforms can create InfoSec friction. Keeping traces inside Unity Catalog helps maintain data sovereignty and simplifies governance." |
| Analytics, not just telemetry | "While SaaS tools are strong for operational metrics like latency, the Lakehouse provides an analytics engine. You can join traces with business data, such as revenue and conversions, to understand real impact." |
Evaluation feedback loop¶
prod traces (otel_spans + trace_metadata)
│
▼
SQL warehouse search/materialize
│
▼
evaluation dataset
│
▼
MLflow LLM judges (built-in + custom guidelines)
│
├──► dev: scored against eval dataset before release
└──► prod: same judges run on live traces continuously
(regression / drift / emerging failure detection)
Customer signals (verbatim)¶
- Experian — "hundreds of thousands of traces through governed Delta tables and evaluates agent quality at scale - all without leaving Databricks" (Eva virtual assistant + Latte email automation; James Lin, Head of AI/ML Innovation).
- Superhuman (Grammarly) — "scale to hundreds of thousands of traces per day, and our researchers can self-serve and explore agent behavior directly in the MLflow UI with no engineering support" (Martin Jewell, Lead MLE AI Infrastructure); explicitly replaces "custom or point solution" with "a real pain point for our teams".
- SmartSheet — "During a three-day co-build with Databricks, we stood up two production agents", "tens of thousands of evaluations" (Kapil Ashar, VP of Engineering).
- The Standard — insurance customer; agents extract underwriting + claims data; "By governing traces in Unity Catalog alongside the rest of our data… we can query, monitor and iterate securely - without adding unnecessary complexity" (Porter Orr, AVP of AI and Automation).
Operational numbers¶
- Throughput floor: 200 QPS (Zerobus ingest). Higher available via account team.
- Storage limit: none.
- Trace-per-experiment cap: previously a constraint, now eliminated.
- Auto liquid-clustering: on by default after the latest product update.
- Customer scale points: Experian "hundreds of thousands of traces"; Superhuman "hundreds of thousands of traces per day"; SmartSheet "tens of thousands of evaluations".
- Sample-agent execution shape: 3 Genie tool calls per top-level user question, surfaced as 3 spans + 1 root span.
- Disclosed cost outlier example: "a single complex query that costs $0.50 because of a retrieval loop" (custom contract-pricing dashboard catches this; native list-price dashboards don't).
Caveats¶
- Tier-3 product-launch post, not an internals deep-dive. Most architectural claims are framed as feature descriptions, not stress-tested production retrospectives. The customer quotes are commercial testimonials and should be read as marketing-validated rather than independently audited.
- Zerobus Ingest internals are opaque. The post names the engine and its protocol surface (OTLP/gRPC + REST), but not partitioning, durability semantics (synchronous-write vs async-buffer), or back-pressure behaviour. "high-throughput telemetry pipeline, handling ingestion and durability with zero infrastructure overhead" is the only durability framing.
- Single-sink claim is architectural marketing, not a benchmark. The case for "bypassing intermediate message buses like Kafka" is plausible — fewer hops, less to operate — but the post does not benchmark Zerobus ingest latency vs a Kafka-fronted equivalent, nor does it discuss what fan-out looks like if the same traces need to land in multiple destinations.
- No PII handling at the ingest layer. Verbatim from the FAQ: "This feature does not apply any special handling to PII." The customer is responsible for applying UC column masks / row filters after the fact. For organisations without an existing UC data-classification pipeline, the "governance is automatic" framing is aspirational.
- 200 QPS starting throughput is below what a high-traffic enterprise agent fleet would generate. The post promises higher limits via account team but does not name a public ceiling. Comparable APM-tier offerings advertise much higher per-tenant ingest rates without account engagement.
- MLflow trace cap removal is implementation-specific. Existing customers using older MLflow trace retention may need to re-provision tables to migrate; the post does not detail the migration path.
- The single-sink ETL feedback loop is described, not benchmarked. "trigger alerts when latency exceeds defined thresholds, tool failures spike, or token usage deviates from expected baselines" is described as possible, but the post doesn't disclose a customer who has built it nor an alert-trigger latency target. CDF-driven ETL latency is typically minutes-to-tens-of-minutes, not the seconds-class real-time alerting an APM stack provides.
- Liquid-clustering recommendation is hand-wavy on the threshold. "For larger trace volumes, however, you should create a materialized view on top of the derived views and incrementally refresh it to maintain query performance" — but no scale number for "larger volumes" is given. Practitioners must instrument and benchmark.
- The MLflow Experiment UI is a Databricks-platform consumer. Customers who want trace data visible in non-Databricks tooling get the lakehouse property (Delta tables queryable by external engines via Delta Kernel) but lose the MLflow UI's drill-into-spans + trace-tree-visualisation. The article frames this as a feature; for some teams it's a tradeoff.
- The evaluation feedback loop assumes high-quality LLM judges. "MLflow provides a set of built-in judges, and also allows us to define custom guidelines tailored to our agent's expected behavior." — but judge accuracy / agreement with humans is not disclosed. The 2026-05-13 Claroty CSAF post (separate ingest) explicitly disclosed a deliberately-conservative pass/fail/unknown judge ternary; no such discipline is described here.
- No throughput numbers from sample agent. The Support Manager Assistant reference is a single-user demo. Customer disclosures are at the "hundreds of thousands of traces" level but without QPS, payload-size, or judge-evaluation throughput numbers — operationally the "how big can this go" question is unanswered.
- Inference Tables vs OTel trace tables relationship is implicit. The 2026-05-20 Inference Tables post described full-payload prompt/response capture as a UC-Delta substrate; this 2026-05-22 post describes OTel-format spans/logs/metrics in UC-Delta. Both feed the "governed lakehouse audit trail" but the schema and granularity differ. The post does not explicitly relate the two; readers must infer that Inference Tables is a Unity-AI-Gateway-emitted full-payload audit substrate, while OTel trace tables are agent-side spans capturing the path of execution rather than the verbatim model-call payload.
New systems / concepts / patterns introduced¶
Systems:
- systems/zerobus-ingest — managed serverless OTLP/gRPC + REST receiver writing direct to UC Delta tables; named explicitly in this post for the first time on the wiki.
- systems/uc-otel-trace-tables — the six MLflow-derived UC table/view surface for OTel spans/logs/metrics + MLflow annotations.
- systems/mlflow-otel-tracing — MLflow's tracing API surface (autolog adapters for LangChain/LangGraph/OpenAI/etc., @MLflow.trace decorator, dataset-bootstrap from traces, judge integration); the agent-instrumentation companion to the lakehouse-resident tables.
Concepts: - concepts/single-sink-telemetry-architecture — the structural shape: managed receiver → durable lakehouse store, no intermediate broker hop. - concepts/instrumentation-storage-decoupling — OTel as the protocol-portable boundary, so agents can run anywhere and storage backends can swap without re-instrumenting. - concepts/production-traces-as-evaluation-substrate — durable production traces become the source of truth for evaluation datasets, more representative than synthetic test cases.
Patterns: - patterns/managed-otel-ingestion-direct-to-lakehouse — Zerobus shape: managed OTLP/gRPC + REST receiver, single-sink to columnar lakehouse tables, no broker. - patterns/bootstrap-eval-dataset-from-production-traces — SQL-warehouse-driven materialisation of trace prompts as evaluation dataset records; pairs with LLM-judge scoring. - patterns/component-level-latency-from-otel-spans — per-span (per-tool / per-model-call) latency P50/P99 dashboards built on OTel-spans table; finer granularity than trace-level latency.
Source¶
- Original: https://www.databricks.com/blog/observability-any-agent-anywhere-production-ready-tracing-opentelemetry-unity-catalog
- Raw markdown:
raw/databricks/2026-05-22-observability-for-any-agent-anywhere-production-ready-tracin-23e97b04.md
Related¶
- companies/databricks
- systems/zerobus-ingest
- systems/uc-otel-trace-tables
- systems/mlflow-otel-tracing
- systems/mlflow
- systems/opentelemetry
- systems/unity-catalog
- systems/inference-tables — sibling full-payload audit substrate (2026-05-20).
- systems/databricks-genie — invoked as a tool by the article's reference agent.
- systems/langgraph — instrumentation library used in the reference agent.
- concepts/single-sink-telemetry-architecture
- concepts/instrumentation-storage-decoupling
- concepts/production-traces-as-evaluation-substrate
- concepts/lakehouse-native-observability
- concepts/observability
- concepts/delta-change-data-feed
- concepts/llm-as-judge
- concepts/audit-trail
- patterns/managed-otel-ingestion-direct-to-lakehouse
- patterns/bootstrap-eval-dataset-from-production-traces
- patterns/component-level-latency-from-otel-spans
- patterns/telemetry-to-lakehouse — generalisation; this source is the third in-scope citation.
- patterns/inference-payload-table-for-audit — sibling lakehouse audit shape.