Skip to content

PATTERN Cited by 1 source

Dual-tier observability (TSDB + lakehouse)

At hyperscale, neither a TSDB alone nor a lakehouse alone is adequate for an observability platform:

  • TSDB alone — scales on object-storage for cold data (see concepts/tiered-storage-hot-warm-cold), but its in-memory
  • on-disk tiers still scale with active cardinality, forcing either (a) a heavy aggregation tier that drops debugging dimensions (see patterns/aggregation-shield-for-tsdb-cardinality) or (b) an unbounded cost curve.
  • Lakehouse alone — can store arbitrary-cardinality raw data cheaply in object storage, but query latency is minutes-range, not real-time. Alerting rules, live dashboards, and interactive PromQL-style queries need the TSDB's in-memory index to meet SLOs.

The solution is to run both tiers simultaneously, each tuned to its strengths, and unify them at the user-facing metric-semantics layer.

Shape

                    Aggregation shield (drops high-card labels)
  Applications  ─ emit ──────┤
   once                      │
                             ├──▶  TSDB (aggregated, real-time)
                             │     ~seconds freshness
                             │     alerting + live dashboards
                             └──▶  Lakehouse (raw, cheap, high-card)
                                   ~minutes freshness
                                   deep troubleshooting + analytics

Both tiers receive data from the same single emission interface (see concepts/unified-metric-semantics); users don't know which tier serves a given query, because the query router chooses based on query shape (cardinality, time range, aggregation level).

Division of responsibilities

Property TSDB tier (Pantheon) Lakehouse tier (Hydra)
Freshness Real-time (sub-second) ~5 minutes end-to-end
Cardinality ceiling Bounded by aggregation rules Unbounded — columnar scan
Query latency PromQL ms SQL seconds–minutes
Storage cost Higher (in-memory heavy) ~50× cheaper (columnar + object)
Primary workloads Alerting + live dashboards Incident triage + analytics
Join with other data Not natively Natively (Unity Catalog)

Why unified metric semantics is load-bearing

The architectural discipline that makes the dual-tier split work: users don't think about which tier serves their query. They write metric_name{labels...} in PromQL or SQL, and the platform routes. Without this discipline, the dual-tier split becomes a user-facing tax (which tool do I use? which tier has my data? which query language?) that would block adoption. See concepts/unified-metric-semantics.

Translation layers

The user-facing query surface for the lakehouse tier is not SQL-only — a PromQL-to-SQL translation layer lets Grafana dashboards and alert rules run unmodified against the lakehouse. This means the dual-tier split is completely invisible to the user's dashboard layer. See patterns/promql-to-sql-over-delta-tables.

CUJ-first design

The dual-tier split is motivated by Critical User Journeys: the user-facing CUJs (live dashboard, alert, incident triage, analytics join) are each served optimally by a different tier, so the architecture splits to match rather than compromising on a single tier.

Seen in

Last updated · 451 distilled / 1,324 read