Skip to content

PATTERN Cited by 1 source

Hedged observability stack

The pattern

Split the observability stack across multiple providers such that any single provider's outage produces degraded but functional observability, not a blackout. Typical split: self-host the data collection + storage tier (metrics, logs, traces) while outsource the dashboarding + alerting tier to a third-party vendor.

Canonical instance (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):

"Despite self-hosting our observability data and stack, we still use a third-party provider for dashboarding and alerting needs. This provider was partially affected. We could still monitor metrics, but we were not getting alert notifications."

"Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale."

Why the split

Observability at scale has three layers:

  1. Collection + storage — collectors, agents, buffers, time- series DB / log DB / trace DB. Expensive per TB at scale. High cost motivation to self-host.
  2. Query engine — the querying interface to the stored data. Typically tightly coupled to storage.
  3. Dashboarding + alerting — UI, dashboards, alert rule engine, notification delivery. Engineer-user-facing; UX matters; commodity SaaS offerings exist at reasonable cost.

The hedging insight: failure modes of layer 3 are orthogonal to failure modes of layer 1-2. A cloud outage might take down your self-hosted collectors (if they share infra with the affected app); a third-party UI/alerting vendor outage takes down just the alerting path, not the data.

What "degraded" looks like

With the split, during third-party outage:

  • Metric collection continues — agents + storage still running.
  • Log collection continues — same.
  • Data is queryable — via direct access to self-hosted substrate (internal tools, API calls, CLI query).
  • Prebuilt dashboards unavailable — vendor UI down.
  • Alert notifications unavailable — vendor alerter down.
  • Team-shared dashboards unavailable — vendor UI down.

The critical property: operators can still triage. They just have to work harder (direct queries instead of dashboard glances, pull-monitoring instead of alert-driven).

Redpanda's 2024 migration

Per the post, Redpanda moved to this hedged architecture "last year" (i.e., 2024):

"We moved to a self-managed observability stack last year, primarily due to increased scale and cost, and were only using a third-party service for dashboarding and alerting needs."

The motivation was cost at scale — but the reliability benefit showed up during the 2025-06-12 GCP / Cloudflare cascading outage. Post-hoc reliability payoff on a change made for economic reasons.

The three design variants

  1. Self-host data, third-party UI/alerts (Redpanda's shape) — cost-motivated primary; reliability bonus secondary.
  2. Self-host everything — maximum resilience, highest engineering burden. Suitable for largest operators with dedicated observability teams.
  3. Multi-vendor everything — dual-push metrics/logs to two vendors, dual-alert via two notification systems. High cost, high reliability. Suitable for tier-1 SaaS operators where alerting availability is a customer-facing SLO.

Most organisations land at variant 1; variant 3 is rare.

Failure-mode decomposition

The hedged stack has three distinct partial-failure modes instead of one full-failure mode:

Failure mode Observability posture Mitigation
Collection layer outage Fully blind on affected data Incident mode, no metrics
Storage layer outage Collection continues, queries fail Buffer in collectors, wait for recovery
UI/alerting layer outage Degraded — queryable, no alerts Direct-query fallback

The pattern trades fewer-but-larger outages for more-but- smaller ones. The math works when the smallest one (UI/alerting) has a cheap, pre-built fallback path.

The load-bearing fallback: direct query access

The pattern works only if engineers have direct query access to the self-hosted substrate. If all query paths are through the third-party UI, the split provides no value. Required:

  • Documented direct-query URLs / CLI commands — operators know how to query Prometheus / Loki / Tempo without the UI.
  • Baseline runbooks expressed in direct queries — e.g., a PromQL one-liner to check cluster-wide write-path health, pasteable into the raw Prometheus UI or promtool.
  • Write-access as well as read-access — can operators tweak alert rules / scraping configs via API if needed.
  • Pre-built auxiliary dashboards — some teams keep a minimal self-hosted Grafana instance as failover.

Secondary alerting channel

Alerting is the part that degrades the worst in the hedged shape, so a common addition is a secondary alerting channel:

  • Self-hosted Alertmanager as failover for the third-party alerter.
  • Dual-webhook PagerDuty routing — same alerts hit two channels via two paths.
  • SMS fallback — operator phones receive critical alerts independently of the primary notification vendor.

The Redpanda post does not disclose a specific secondary alerting channel; the implication is that loss of alerts was acceptable because the team was already in preemptive-SEV mode.

Anti-patterns

  • Hedge on paper, not in practice. Engineers have forgotten how to use the self-hosted tools; when UI vendor is down, they are effectively blind despite nominal hedging.
  • Dashboards only in the third-party UI. If every graph is built in the vendor's dashboard language, a vendor outage means the graphs are all gone regardless of where the data lives.
  • Alert rules only in the third-party alerter. Same shape: rule definitions exist only in the vendor, not exportable, no self-hosted fallback.
  • Self-hosted collectors pointing at third-party storage. The hedging is only real if the data layer is under your control.
  • No secondary alerting channel + no pull-monitoring discipline — alert loss during vendor outage becomes a real incident.

Caveats

  • Cost-quality-reliability trade. Self-hosting has real operational cost (team, infrastructure, on-call for the observability tier itself); the hedged pattern pays the incremental cost of adding self-host while keeping the third-party UX.
  • Skill drift risk. If engineers rely on the vendor UI 99% of the time, their direct-query skills atrophy; the 1% incident use is then harder.
  • Third-party vendor lock-in. Even with self-hosted data, dashboards / alert rules written in the vendor's DSL create migration cost if the vendor relationship ends.
  • Not a substitute for observability quality. A partial- failure-tolerant observability stack is still useless if the underlying metrics / logs / traces don't measure what matters.
  • Correlated failure risk is not zero. A global outage (e.g., a CDN or DNS-level event) could affect both self- hosted and third-party layers. The pattern reduces correlation, doesn't eliminate it.

Seen in

Last updated · 470 distilled / 1,213 read