SYSTEM Cited by 1 source
Hydra (Databricks lakehouse observability)¶
Hydra is Databricks' lakehouse-native observability platform for raw unaggregated high-cardinality troubleshooting data, complementing Pantheon's aggregated-TSDB role. It ingests ~20 billion unaggregated active timeseries from "millions of nodes worldwide", achieves ~5 minutes end-to-end data freshness, and stores data at ~50× lower cost than Thanos.
Hydra exists because aggregation (the cardinality shield in front of Pantheon — see patterns/aggregation-shield-for-tsdb-cardinality) drops exactly the dimensions engineers need during incidents ("which tenant is causing swap pressure, which node crashed, which shard is isolated, which workload is noisy"). Hydra provides the "needle in a haystack" raw-data surface that aggregation erases, without pushing that cardinality back onto the TSDB.
Architecture¶
Three enabling primitives:
- Apache Spark Structured Streaming continuous ingestion jobs that incrementally process metric data as it arrives and write to Delta Lake. Structured Streaming gives "streaming computations the same way you write batch jobs" with continuous, incremental processing and exactly-once semantics for reliable ingestion.
- Databricks Auto Loader as the Structured Streaming source — a high-throughput source that "tracks and incrementally processes new files without requiring manual listing or state management. Auto Loader automatically persists metadata about discovered files and scales to handle near-real-time arrival patterns." Critical at millions-of-files-per-region ingestion volume.
- Per-region partitioned ingestion — "independent streaming jobs across geographies". This enables each pipeline to autoscale independently, minimises cross-region latency, and reduces blast radius in case of failures.
Canonical instance of concepts/lakehouse-native-observability: observability data stored as governed Delta tables inside the same Unity Catalog as business data, joinable with enterprise datasets under the same access controls.
User-facing interfaces¶
Querying through Grafana: Hydra integrates directly with Grafana via a PromQL-to-SQL conversion layer (canonical instance of patterns/promql-to-sql-over-delta-tables + concepts/promql-to-sql-translation). Engineers continue writing PromQL, using existing dashboards, drilling into labels — but queries execute against large-scale Delta tables rather than an in-memory TSDB. Interface stability is the load-bearing design constraint: the substrate changes (TSDB → lakehouse) but the user's workflow doesn't.
Direct SQL access in Databricks: for investigations that require deeper analysis — join metrics with deployment metadata, correlate with logs, wide time-range scans, anomaly detection, export for analytics — Hydra exposes the underlying Delta tables directly. Engineers query via Databricks SQL or notebooks. Because the data resides in the lakehouse, "it becomes joinable with other enterprise datasets and governed under the same security and access controls. This turns observability data into a first-class analytical asset rather than an isolated monitoring silo."
Unified metric semantics: canonical instance of concepts/unified-metric-semantics. "A key design principle of Hydra is that engineers should not need to understand our ingestion architecture. Whether a metric is accessed through the TSDB-backed aggregated path, or the Lakehouse-backed raw metric path, the interface remains consistent. Metric names, label semantics, and metadata dimensions are unified across environments." Service teams emit metrics once; the platform handles aggregation / raw preservation / ingestion / storage / query routing.
Design-discipline: Critical User Journeys (CUJs)¶
"Building Hydra was not just an infrastructure challenge; it was an interface design challenge. From the beginning, we designed Hydra around Critical User Journeys (CUJs) for our engineers rather than around storage layers or ingestion pipelines." See concepts/critical-user-journey.
Operating numbers (2026-05)¶
- 20 billion unaggregated active timeseries.
- "Millions of nodes" emitting metrics.
- ~5 minutes end-to-end data freshness (acknowledged gap vs. Pantheon's real-time; "improving the performance of Hydra so it achieves similar data freshness to Pantheon" is named future work).
- ~50× cheaper data storage than Thanos.
- Per-region independent streaming jobs — each autoscales + fails independently.
Caveats disclosed¶
- Freshness (~5 min) is materially worse than Pantheon's real-time; the roadmap is to close this gap so the two experiences converge.
- PromQL-to-SQL translation fidelity (range-vector semantics, rate() on partial data, histogram-quantile arithmetic) is not disclosed in the post.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical source; ingestion architecture, Grafana integration, unified metric semantics, scaling numbers all disclosed.
Related¶
- systems/pantheon — aggregated-TSDB sibling
- systems/thanos — upstream of Pantheon
- systems/delta-lake — storage substrate
- systems/apache-spark — ingestion compute
- systems/databricks-auto-loader — file-discovery substrate
- systems/grafana — query UI
- systems/databricks
- companies/databricks
- concepts/lakehouse-native-observability
- concepts/metric-cardinality
- concepts/promql-to-sql-translation
- concepts/unified-metric-semantics
- concepts/critical-user-journey
- patterns/dual-tier-observability-tsdb-plus-lakehouse
- patterns/promql-to-sql-over-delta-tables