Skip to content

CONCEPT Cited by 2 sources

Lakehouse-native observability

Lakehouse-native observability is the architectural posture of storing observability data (metrics, logs, traces) as governed tables in a columnar-object-storage lakehouse (Delta Lake / Iceberg) rather than in purpose-built TSDBs / log-stores / trace-stores.

The defining payoff: observability data becomes a first-class analytical asset joinable with enterprise datasets (deploy logs, org metadata, feature-usage data, billing) under the same governance controls — not a siloed operational-only dataset.

What it replaces

A traditional observability stack has three purpose-built datastores:

  • A TSDB (Prometheus / Thanos / Cortex / Mimir) for metrics.
  • A log-store (Elasticsearch / Loki / Splunk) for logs.
  • A trace-store (Jaeger / Tempo / Honeycomb) for traces.

Each uses its own indexing scheme, query language, retention model, and cost structure. Cross-store joins are hard: "show me all log lines from tenant-X's pods during the metric spike at 14:32" requires coordinating three separate systems.

Lakehouse-native observability puts all three in Delta Lake tables, queryable via SQL and joinable with any other enterprise dataset.

What it enables

  • 50× cheaper storage than a TSDB for raw unaggregated data at the 20B-active-timeseries scale (Databricks Hydra canonical datum).
  • Arbitrary-cardinality queries — the columnar store scales with object-storage economics, not with in-memory index cost.
  • Joins against enterprise datasets — under the same Unity Catalog / Iceberg governance as business data.
  • Existing analytics tooling works — SQL notebooks, dashboards, ad-hoc queries, anomaly detection, export.
  • Preserved user interfaces via translation layers — Grafana / PromQL continues to work via PromQL-to-SQL, so user workflows are unchanged.

What it costs

  • Freshness: streaming ingestion into Delta Lake typically lands in the minutes-to-seconds range, not sub-second. Canonical Hydra datum: ~5 minutes end-to-end, materially worse than Pantheon's real-time.
  • Query latency: columnar scans over Delta tables are not as fast as in-memory TSDB queries for small, point-lookup workloads. Lakehouse shines on wide / analytical scans.
  • Translation-layer fidelity: PromQL over SQL has edge cases (range-vector semantics, rate() on partial data, histogram quantile arithmetic) that aren't trivially expressible.

The dual-tier posture

The practical architecture is not lakehouse-only; it's lakehouse for raw unaggregated troubleshooting data + TSDB for aggregated real-time queries, unified at the user-facing metric-semantics layer so engineers don't distinguish between them. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Seen in

  • sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalogAgent-trace specialisation: OTel direct-to-Delta via managed receiver. Where the 2026-05-05 Hydra disclosure was about high-cardinality time-series metrics in the lakehouse, the 2026-05-22 disclosure extends the posture to agent-side OpenTelemetry spans / logs / metrics stored in UC OTel Trace Tables via Zerobus Ingest. Three SaaS-vs- lakehouse asymmetries argued verbatim — "retention economics" (object storage cheaper than SaaS), "the PII deadlock" (no third-party data egress), and "analytics, not just telemetry" (joinable with business data). The governance-inheritance argument: "By storing it in Unity Catalog, traces inherit fine-grained access controls, from catalog and schema permissions to column masking and row-level filtering, enabling secure, production-ready analytics without limiting flexibility." Lakehouse-resident trace data also doubles as evaluation-dataset substrate (see concepts/production-traces-as-evaluation-substrate) — "One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases." Operational disclosures: 200 QPS starting throughput, unbounded storage, MLflow per-experiment trace cap removed, auto liquid-clustering. Customers operating at scale named: Experian, Superhuman/Grammarly, SmartSheet, The Standard. Composes with concepts/single-sink-telemetry-architecture (the ingest-side shape), concepts/instrumentation-storage-decoupling (OTel as protocol-portable boundary), patterns/managed-otel-ingestion-direct-to-lakehouse (the sub-pattern), patterns/telemetry-to-lakehouse (the generalisation), patterns/component-level-latency-from-otel-spans (a query-side dashboard pattern over _otel_spans).

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "Our key insight: the Databricks lakehouse is a perfect fit! It decouples storage (cheap object storage + Delta Lake) from compute (streaming + query clusters) and is massively scalable on both dimensions." Hydra: 20B active unaggregated timeseries, ~5 min freshness, 50× cheaper storage than Thanos, queryable via PromQL-to-SQL from Grafana or directly via Databricks SQL. "This turns observability data into a first-class analytical asset rather than an isolated monitoring silo."

Last updated · 547 distilled / 1,650 read