Skip to content

CONCEPT Cited by 1 source

Lakehouse-native observability

Lakehouse-native observability is the architectural posture of storing observability data (metrics, logs, traces) as governed tables in a columnar-object-storage lakehouse (Delta Lake / Iceberg) rather than in purpose-built TSDBs / log-stores / trace-stores.

The defining payoff: observability data becomes a first-class analytical asset joinable with enterprise datasets (deploy logs, org metadata, feature-usage data, billing) under the same governance controls — not a siloed operational-only dataset.

What it replaces

A traditional observability stack has three purpose-built datastores:

  • A TSDB (Prometheus / Thanos / Cortex / Mimir) for metrics.
  • A log-store (Elasticsearch / Loki / Splunk) for logs.
  • A trace-store (Jaeger / Tempo / Honeycomb) for traces.

Each uses its own indexing scheme, query language, retention model, and cost structure. Cross-store joins are hard: "show me all log lines from tenant-X's pods during the metric spike at 14:32" requires coordinating three separate systems.

Lakehouse-native observability puts all three in Delta Lake tables, queryable via SQL and joinable with any other enterprise dataset.

What it enables

  • 50× cheaper storage than a TSDB for raw unaggregated data at the 20B-active-timeseries scale (Databricks Hydra canonical datum).
  • Arbitrary-cardinality queries — the columnar store scales with object-storage economics, not with in-memory index cost.
  • Joins against enterprise datasets — under the same Unity Catalog / Iceberg governance as business data.
  • Existing analytics tooling works — SQL notebooks, dashboards, ad-hoc queries, anomaly detection, export.
  • Preserved user interfaces via translation layers — Grafana / PromQL continues to work via PromQL-to-SQL, so user workflows are unchanged.

What it costs

  • Freshness: streaming ingestion into Delta Lake typically lands in the minutes-to-seconds range, not sub-second. Canonical Hydra datum: ~5 minutes end-to-end, materially worse than Pantheon's real-time.
  • Query latency: columnar scans over Delta tables are not as fast as in-memory TSDB queries for small, point-lookup workloads. Lakehouse shines on wide / analytical scans.
  • Translation-layer fidelity: PromQL over SQL has edge cases (range-vector semantics, rate() on partial data, histogram quantile arithmetic) that aren't trivially expressible.

The dual-tier posture

The practical architecture is not lakehouse-only; it's lakehouse for raw unaggregated troubleshooting data + TSDB for aggregated real-time queries, unified at the user-facing metric-semantics layer so engineers don't distinguish between them. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "Our key insight: the Databricks lakehouse is a perfect fit! It decouples storage (cheap object storage + Delta Lake) from compute (streaming + query clusters) and is massively scalable on both dimensions." Hydra: 20B active unaggregated timeseries, ~5 min freshness, 50× cheaper storage than Thanos, queryable via PromQL-to-SQL from Grafana or directly via Databricks SQL. "This turns observability data into a first-class analytical asset rather than an isolated monitoring silo."
Last updated · 451 distilled / 1,324 read