CONCEPT Cited by 1 source

Observability stack partial dependency¶

Definition¶

An observability stack partial dependency is the failure-mode property of an observability architecture that splits responsibilities across providers — typically self-hosting the data collection + storage layer (metrics, logs, traces) while relying on a third-party for dashboarding and alerting. Under partial outage of the third-party vendor, the observability posture degrades but does not go blind: operators can still query data directly against the self-managed substrate, but lose push-notification alerting and pre-built dashboards.

Canonical instance (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage): during the 2025-06-12 GCP outage, Redpanda's third-party dashboarding/alerting vendor was affected. Verbatim:

"Despite self-hosting our observability data and stack, we still use a third-party provider for dashboarding and alerting needs. This provider was partially affected. We could still monitor metrics, but we were not getting alert notifications. We deemed the loss of alert notifications not critical since we were still able to assess the impact through other means, such as querying our self-managed metrics and logging stack."

The critical design property is partial degradation, not full-failure-or-full-availability. An observability stack fully hosted on a single vendor that has an outage is blind; a split-vendor stack is degraded — slower to detect, harder to triage, but still functional.

Configuration	Full-vendor outage	Partial-vendor outage
Fully-hosted single vendor	Blind (no metrics, no logs, no alerts)	Blind
Self-hosted data + self-hosted UI	Full function	Full function
Self-hosted data + third-party UI + alerts	Degraded (data queryable, alerts missing)	Full function

The split-vendor architecture is the cheaper-than-fully-self-hosted option that preserves non-blind behaviour during third-party outage. Per the Redpanda post: "Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale."

Design axes¶

Where to split the observability stack depends on four axes:

Cost per TB. Self-hosting data is expensive at scale; UI
alerting is cheap.
Engineering capacity. Self-hosting all layers demands dedicated observability engineers; UI + alerting is outsource-ready.
Query sophistication. Self-hosted substrate can be queried via any engine; vendor UI constrains query surface to what the vendor supports.
Availability correlation. Self-hosting + using a third-party for alerting is uncorrelated — if AWS goes down, your self-hosted AWS logs go dark. If the vendor goes down, your alerting goes dark. The two failure modes rarely coincide.

Failure-mode decomposition¶

The split stack has three discrete degradation modes:

Alerting-only degradation: no push notifications. Operators must pull-monitor dashboards or direct queries. Mitigated by having a secondary alerting channel or escalating to human on-call who pull-checks more aggressively.
Dashboarding-only degradation: no pre-built views. Operators must build ad-hoc queries against raw data. Slower triage but not blind.
Combined alerting + dashboarding degradation: both above. The Redpanda 2025-06-12 case.

Composes with other reliability practices¶

concepts/static-stability — the observability stack itself should be statically stable: self-hosted data collection should keep working when the vendor UI is down.
concepts/runtime-dependency-on-saas-provider — the vendor UI is an explicitly-acknowledged SaaS runtime dependency; the split stack narrows its failure-mode scope to presentation, not data.
patterns/hedged-observability-stack — the canonical pattern name for the split: self-host data, third-party UI, fallback alerting channel.
concepts/cross-cloud-dependency — a related axis: the third-party vendor might be hosted on the same cloud you're monitoring, creating correlated-failure risk that the split pattern aims to mitigate.
concepts/grey-failure — during partial-vendor outage the observability stack is itself in grey-failure mode: some paths work, others don't, operators have to navigate around the dead paths.

Anti-patterns¶

Single-vendor full observability stack — saves the operator engineering cost of managing collectors, storage, and UI at the price of correlated fleet-wide observability outage when the vendor goes down.
Self-hosted everything — solves the correlated-failure problem at the price of dedicated engineering + expensive storage scaling.
Primary-alerts-on-third-party, no-secondary-channel — if the third-party goes down during an incident, the on-call pager is silent precisely when it shouldn't be. The split pattern requires a backup alerting channel (self-hosted Alertmanager, PagerDuty webhook redundancy, SMS fallback).

Caveats¶

Redpanda does not name the vendor(s). The post says "a third-party provider for dashboarding and alerting" — neither vendor identified.
The cloud-marketplace vendor is separate — the post names a second third-party "vendor we use for managing cloud marketplaces" that was also affected by the cascading Cloudflare outage. Orthogonal to observability.
Partial-failure mode is itself latent. You do not know the quality of your degraded-mode observability until you run in it. The Redpanda post is retrospective confirmation that degraded mode was sufficient this time; future outages may expose gaps.
Fleet-wide log searching capability is the load-bearing property per the post — losing it (what would have happened under full-vendor-hosting) is what Redpanda explicitly avoided. The partial dependency's value is proportional to how useful that capability is.
The economics are explicit but not quantified. Redpanda says "exponentially bigger cost ramifications" of the alternative but doesn't disclose the multiplier.

Seen in¶

sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical instance of the pattern named as a deliberate architectural choice: self-hosted observability data + stack, third-party for dashboarding and alerting, which resulted in degraded-but-usable observability during the 2025-06-12 cascading outage (Cloudflare + GCP).