Skip to content

PATTERN Cited by 1 source

Component-level latency from OTel spans

Component-level latency from OTel spans is the pattern of computing per-span / per-tool / per-component latency percentiles (P50 / P99) directly over OTel-spans tables to attribute end-user latency to the specific component in the agent execution path that's slow — instead of stopping at the trace-level (whole-request) latency that native dashboards default to.

Mechanics

trace-level latency (native dashboards)  ── tells you "P99 is high"
        │  drill in
span-level latency (this pattern)        ── tells you "the retrieve_docs tool is the bottleneck"
        │  custom SQL on <prefix>_otel_spans
SELECT span_name,
       PERCENTILE(duration, 0.5) AS p50,
       PERCENTILE(duration, 0.99) AS p99,
       AVG(error) AS error_rate
FROM <prefix>_otel_spans
WHERE trace_time > <window>
GROUP BY span_name
ORDER BY p99 DESC

The structural insight: trace-level latency averages multiple components together, hiding which one is the cause. Per-span aggregation surfaces that.

Canonical instance: Databricks AI Operations Center (2026-05-22)

"Native latency views show P50/P99 at the trace level. To go a layer deeper and see which tool is slow, we built a Tool Performance widget that breaks down latency (P50, P99) and error rates per individual tool in the agent (for example, retrieve_docs vs. generate_response). That tells us whether the LLM, a Genie tool call, or another step is the bottleneck, so we can pinpoint exactly where the user experience is degrading."

— Source: sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog

The Databricks AI Operations Center custom-dashboard example pairs this pattern with two more on the same UC OTel substrate:

  • Custom Cost Analysis with Contract Pricing — token-usage by model + contract pricing → Estimated Cost per Trace (catches outliers like "a single complex query that costs $0.50 because of a retrieval loop").
  • Component-Level Performance — this pattern.

Both exploit the same property: "the trace tables are still just Delta tables in Unity Catalog. You can build a custom AI/BI Dashboard against them and write standard SQL (with help from AI) to model whatever your team cares about."

Why span-level matters

A multi-tool agent's trace decomposes into:

root span (request)
├── LLM call 1 (planner)              ── 500 ms
├── tool: retrieve_docs                ── 8000 ms  ◄── bottleneck
├── tool: generate_response (LLM 2)    ── 1200 ms
└── tool: format_output                ── 50 ms

Trace-level latency reports "P99 = ~9.8 s". That's correct but uninformative — it doesn't tell you the retrieval is the cause. Span-level reports retrieve_docs P99 = 8 s, immediately actionable.

When this pattern matters most

  • Multi-tool agents with many components per request — the more components, the more trace-level latency obscures which one matters.
  • Long-tail latency investigations"why is P99 spiking?" is unanswerable at the trace level for any agent with ≥3 components.
  • Cost-attribution at the component level — the Databricks custom-pricing dashboard joins span identity (which tool / which model) with token usage to attribute cost per component.
  • Tool-replacement decisions"if we swap retrieval implementations, what does P99 become?" requires component-level-attributed baseline.

Composition with other patterns

Caveats

  • Span-name discipline matters. If retrieve_docs is sometimes named retrieve and sometimes named docs_retrieval, aggregation fragments. Manual @MLflow.trace(name="retrieve_docs") decoration on entrypoints is a good practice.
  • Wall-clock vs CPU vs IO. Span duration is wall-clock; the bottleneck class (CPU-bound, IO-bound, network-bound) requires additional context.
  • Concurrency obscures causality. Parallel spans complicate the "which step is slow" question; the pattern works best when spans are mostly sequential.
  • Cardinality cost. For agents with many distinct tool names, the GROUP BY can be expensive on huge windows. Pre-aggregation (materialized views, hourly rollups) is the typical scale-out path.
  • Native dashboards may be enough. The Databricks post explicitly says "For most teams, that's enough to monitor day-to-day agent health." This pattern is for teams whose use cases exceed the native views.
  • Span-name leakage. Span names sometimes embed user-supplied identifiers (URLs, IDs); if so, cardinality balloons and aggregation breaks. Span-name normalisation discipline is a prerequisite.

Seen in

Last updated · 542 distilled / 1,571 read