Skip to content

PATTERN Cited by 3 sources

Telemetry to Lakehouse

Telemetry to Lakehouse is the pattern of landing operational / tool / agent telemetry directly into governed open-table-format tables (typically Delta Lake or Iceberg) instead of an APM / observability-vendor sidecar — so the telemetry becomes a first-class Lakehouse dataset joinable with business data.

Mechanics

  • Clients emit OpenTelemetry metrics + traces (standard protocol, avoids vendor lock).
  • A managed ingestion pipeline writes OTel data into Lakehouse-resident tables (Unity-Catalog-managed Delta tables in the Databricks instance).
  • Tables are governed under the same catalog / IAM / audit posture as the rest of the enterprise's data — one policy surface for telemetry + business data.
  • Analysts query telemetry with the same SQL + dashboard tooling as HR, finance, CRM data.

Canonical instance: Unity AI Gateway (2026-04-17)

Databricks' coding-agent post states the design explicitly: "With our OpenTelemetry ingestion, coding tool metrics and traces are automatically centralized to Unity Catalog-managed Delta tables." (Source: sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway.)

Three use cases named in the post:

  • Track adoption per org: "Join AI Gateway metrics with Workday to map GenAI adoption by department, region, or seniority, helping identify where to target enablement."
  • Quantify developer velocity: "A 20% increase in token usage per developer drove a 15% reduction in pull request cycle time, directly linking AI tool usage to increased developer velocity."
  • Proactive capacity planning: "Monitor users hitting rate limits to data-justify securing additional capacity or dedicated throughput before productivity is throttled."

Each of these requires joining gateway telemetry with a non-telemetry dataset (HR org chart, CI/CD PR metrics, capacity-cost tables). That join is the point of the pattern.

Why Lakehouse instead of APM

  • APM vendors silo telemetry. Datadog / New Relic / Grafana Cloud store telemetry in their own backend; joining with Workday or Jira is ETL-out-of-APM, which breaks the freshness
  • governance story.
  • Business-value questions require business-data joins. "Are AI tools changing PR cycle time?" is a question about PR cycle time, not about LLM latency. APM can't answer it.
  • Governance is unified. Same RBAC, same retention policies, same audit log covers telemetry + revenue data + HR data. No "observability data is special" escape hatch.

Costs / caveats

  • Latency vs APM. Lakehouse query latency is seconds-to- minutes, not APM's sub-second. This pattern is for analytical telemetry questions, not real-time alerting. Real-time alerting typically still lives in an APM / Prometheus / Carnaval sidecar.
  • Schema discipline required. Telemetry evolves fast; making it useful to analysts means stable columns / enums.
  • Vendor-lock-on-lakehouse tradeoff. Avoids APM-vendor lock-in; trades it for lakehouse-vendor lock-in. Open table formats (Delta / Iceberg) mitigate.
  • OTel as the wire-protocol boundary keeps the pattern Lakehouse-vendor-portable: Databricks today, could be Snowflake or Trino + Iceberg tomorrow.

Relation to other patterns

Seen in

  • sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalogThird citation: agent-trace specialisation with managed single-sink ingest. The 2026-05-22 OTel-tracing launch ships OTLP/gRPC + REST direct-to-Delta ingest via a managed serverless engine (Zerobus Ingest) writing to UC OTel Trace Tables, with the explicit "single-sink" design point that "streams data directly to the lakehouse" and "entirely bypass[es] intermediate message buses like Kafka". Distinct from the 2026-04-17 metrics+traces face and the 2026-05-20 full-payload face: this is agent-side OTel spans / logs / metrics (one row per span in a trace), paired with MLflow extensions for assessments / feedback / expectations / run-links. Six derived UC views: <prefix>_otel_spans, _otel_logs, _otel_metrics, _otel_annotations, _trace_unified, _trace_metadata. Operational disclosures: 200 QPS starting throughput, unbounded storage, MLflow per-experiment trace cap removed, auto liquid-clustering. Customers named at scale: Experian ("hundreds of thousands of traces"), Superhuman/Grammarly ("hundreds of thousands of traces per day" — explicitly replacing a custom point solution), SmartSheet (two production agents in "three-day co-build"), The Standard. The post argues three SaaS-vs-lakehouse asymmetries verbatim: retention economics (object storage cheaper than SaaS retention), the PII deadlock (no third-party data egress), and analytics-not-just- telemetry ("You can join traces with business data, such as revenue and conversions, to understand real impact and go beyond system health"). The substrate doubles as evaluation input: prod traces bootstrap MLflow eval datasets, same judges run continuously on live traces — see concepts/production-traces-as-evaluation-substrate and patterns/bootstrap-eval-dataset-from-production-traces. Per-span dashboards (per-tool latency P50/P99, per-tool error rate) ride on the same UC tables — see patterns/component-level-latency-from-otel-spans. This citation specialises the pattern from "AI Gateway emits metrics+traces" and "AI Gateway captures full payloads" to "agents emit OTel spans direct to the lakehouse via a managed receiver" — the agent-side, single-sink shape. Concretised as patterns/managed-otel-ingestion-direct-to-lakehouse (the canonical sub-pattern), concepts/single-sink-telemetry-architecture (the structural shape), and concepts/instrumentation-storage-decoupling (OTel as the protocol-portable boundary).

  • sources/2026-05-20-databricks-governing-ai-agents-at-scale-with-unity-catalogSpecialisation of the pattern from metrics+traces to full-payload audit. The 2026-05-20 four-pillars post extends Telemetry to Lakehouse with Inference Tables"the exact prompt sent, the exact response returned, token counts and latency" written to UC-managed Delta tables. The architectural payoff is the same — joinable with business data, governed under one catalog, retainable on customer terms — but the data shape is verbatim request/response payloads rather than aggregated metrics or sampled traces. Canonicalised separately as patterns/inference-payload-table-for-audit (the full-payload variant) so the two scopes don't conflate. Same lakehouse argument: "Most logging architectures force a trade-off between completeness and cost, requiring you to sample, filter, and set short retention windows. Because Unity AI Gateway captures observability data in your lakehouse, you don't have to." Substrate also feeds Lakewatch (agentic SIEM).

  • sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — Unity AI Gateway → Unity-Catalog-managed Delta tables via OpenTelemetry; first-class "telemetry-joined-with-Workday" framing.
Last updated · 542 distilled / 1,571 read