CONCEPT Cited by 1 source

Lakehouse-native observability¶

Lakehouse-native observability is the architectural posture of storing observability data (metrics, logs, traces) as governed tables in a columnar-object-storage lakehouse (Delta Lake / Iceberg) rather than in purpose-built TSDBs / log-stores / trace-stores.

The defining payoff: observability data becomes a first-class analytical asset joinable with enterprise datasets (deploy logs, org metadata, feature-usage data, billing) under the same governance controls — not a siloed operational-only dataset.

What it replaces¶

A traditional observability stack has three purpose-built datastores:

A TSDB (Prometheus / Thanos / Cortex / Mimir) for metrics.
A log-store (Elasticsearch / Loki / Splunk) for logs.
A trace-store (Jaeger / Tempo / Honeycomb) for traces.

Each uses its own indexing scheme, query language, retention model, and cost structure. Cross-store joins are hard: "show me all log lines from tenant-X's pods during the metric spike at 14:32" requires coordinating three separate systems.

Lakehouse-native observability puts all three in Delta Lake tables, queryable via SQL and joinable with any other enterprise dataset.

What it enables¶

50× cheaper storage than a TSDB for raw unaggregated data at the 20B-active-timeseries scale (Databricks Hydra canonical datum).
Arbitrary-cardinality queries — the columnar store scales with object-storage economics, not with in-memory index cost.
Joins against enterprise datasets — under the same Unity Catalog / Iceberg governance as business data.
Existing analytics tooling works — SQL notebooks, dashboards, ad-hoc queries, anomaly detection, export.
Preserved user interfaces via translation layers — Grafana / PromQL continues to work via PromQL-to-SQL, so user workflows are unchanged.

What it costs¶

Freshness: streaming ingestion into Delta Lake typically lands in the minutes-to-seconds range, not sub-second. Canonical Hydra datum: ~5 minutes end-to-end, materially worse than Pantheon's real-time.
Query latency: columnar scans over Delta tables are not as fast as in-memory TSDB queries for small, point-lookup workloads. Lakehouse shines on wide / analytical scans.
Translation-layer fidelity: PromQL over SQL has edge cases (range-vector semantics, rate() on partial data, histogram quantile arithmetic) that aren't trivially expressible.

The dual-tier posture¶

The practical architecture is not lakehouse-only; it's lakehouse for raw unaggregated troubleshooting data + TSDB for aggregated real-time queries, unified at the user-facing metric-semantics layer so engineers don't distinguish between them. See patterns/dual-tier-observability-tsdb-plus-lakehouse.

Seen in¶

sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "Our key insight: the Databricks lakehouse is a perfect fit! It decouples storage (cheap object storage + Delta Lake) from compute (streaming + query clusters) and is massively scalable on both dimensions." Hydra: 20B active unaggregated timeseries, ~5 min freshness, 50× cheaper storage than Thanos, queryable via PromQL-to-SQL from Grafana or directly via Databricks SQL. "This turns observability data into a first-class analytical asset rather than an isolated monitoring silo."

systems/hydra — canonical lakehouse-native observability platform
systems/delta-lake
systems/apache-spark
systems/databricks-auto-loader
systems/databricks
systems/pantheon — TSDB sibling
systems/grafana — query UI
concepts/metric-cardinality
concepts/unified-metric-semantics
concepts/promql-to-sql-translation
concepts/observability
patterns/dual-tier-observability-tsdb-plus-lakehouse
patterns/promql-to-sql-over-delta-tables