CONCEPT Cited by 1 source
Lakehouse-native observability¶
Lakehouse-native observability is the architectural posture of storing observability data (metrics, logs, traces) as governed tables in a columnar-object-storage lakehouse (Delta Lake / Iceberg) rather than in purpose-built TSDBs / log-stores / trace-stores.
The defining payoff: observability data becomes a first-class analytical asset joinable with enterprise datasets (deploy logs, org metadata, feature-usage data, billing) under the same governance controls — not a siloed operational-only dataset.
What it replaces¶
A traditional observability stack has three purpose-built datastores:
- A TSDB (Prometheus / Thanos / Cortex / Mimir) for metrics.
- A log-store (Elasticsearch / Loki / Splunk) for logs.
- A trace-store (Jaeger / Tempo / Honeycomb) for traces.
Each uses its own indexing scheme, query language, retention model, and cost structure. Cross-store joins are hard: "show me all log lines from tenant-X's pods during the metric spike at 14:32" requires coordinating three separate systems.
Lakehouse-native observability puts all three in Delta Lake tables, queryable via SQL and joinable with any other enterprise dataset.
What it enables¶
- 50× cheaper storage than a TSDB for raw unaggregated data at the 20B-active-timeseries scale (Databricks Hydra canonical datum).
- Arbitrary-cardinality queries — the columnar store scales with object-storage economics, not with in-memory index cost.
- Joins against enterprise datasets — under the same Unity Catalog / Iceberg governance as business data.
- Existing analytics tooling works — SQL notebooks, dashboards, ad-hoc queries, anomaly detection, export.
- Preserved user interfaces via translation layers — Grafana / PromQL continues to work via PromQL-to-SQL, so user workflows are unchanged.
What it costs¶
- Freshness: streaming ingestion into Delta Lake typically lands in the minutes-to-seconds range, not sub-second. Canonical Hydra datum: ~5 minutes end-to-end, materially worse than Pantheon's real-time.
- Query latency: columnar scans over Delta tables are not as fast as in-memory TSDB queries for small, point-lookup workloads. Lakehouse shines on wide / analytical scans.
- Translation-layer fidelity: PromQL over SQL has edge cases (range-vector semantics, rate() on partial data, histogram quantile arithmetic) that aren't trivially expressible.
The dual-tier posture¶
The practical architecture is not lakehouse-only; it's lakehouse for raw unaggregated troubleshooting data + TSDB for aggregated real-time queries, unified at the user-facing metric-semantics layer so engineers don't distinguish between them. See patterns/dual-tier-observability-tsdb-plus-lakehouse.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "Our key insight: the Databricks lakehouse is a perfect fit! It decouples storage (cheap object storage + Delta Lake) from compute (streaming + query clusters) and is massively scalable on both dimensions." Hydra: 20B active unaggregated timeseries, ~5 min freshness, 50× cheaper storage than Thanos, queryable via PromQL-to-SQL from Grafana or directly via Databricks SQL. "This turns observability data into a first-class analytical asset rather than an isolated monitoring silo."
Related¶
- systems/hydra — canonical lakehouse-native observability platform
- systems/delta-lake
- systems/apache-spark
- systems/databricks-auto-loader
- systems/databricks
- systems/pantheon — TSDB sibling
- systems/grafana — query UI
- concepts/metric-cardinality
- concepts/unified-metric-semantics
- concepts/promql-to-sql-translation
- concepts/observability
- patterns/dual-tier-observability-tsdb-plus-lakehouse
- patterns/promql-to-sql-over-delta-tables