Skip to content

PATTERN Cited by 1 source

Dedicated observability Kubernetes clusters

Pattern

Run the org's observability workloads (metrics ingestion, storage, query, dashboards, alerting) on Kubernetes clusters dedicated to observability — physically separated from product and infrastructure workloads — while having the platform team still operate those clusters rather than handing Kubernetes operations to the observability team.

This is the practical realisation of dedicated-but- managed infrastructure at the Kubernetes-cluster altitude for observability specifically.

Problem it solves

Observability stacks need to keep functioning when the production systems they monitor fail. If they share a cluster with those production systems, a cluster-level failure (control-plane outage, CNI bug, bad admission controller, node-kernel regression) takes down both — creating a circular dependency that defeats the monitoring stack's purpose.

But running a Kubernetes cluster is not free work. An observability team that adopts its own cluster inherits node operations, upgrade procedures, security patching, the CNI / CSI substrate, and the cluster control plane. Small teams can't absorb this burden.

The dedicated-observability-cluster pattern solves both:

  • Cluster-level fault isolation from the workloads being monitored.
  • Operational substrate continues to belong to the platform team — who operates Kubernetes elsewhere in the org anyway.

When it fits

  • Observability team is small (cannot absorb K8s operations).
  • Platform / Cloud team already runs Kubernetes at scale (marginal cost of N+1 clusters is low).
  • Observability workloads are cluster-volume-significant (not tiny) — running a dedicated cluster for a trivial scrape target is waste.
  • Cluster-level failures have been a real reliability concern for observability coverage (control-plane bugs, admission-webhook regressions, node-agent crashes, etc.).

When it doesn't fit

  • Tiny orgs with a handful of services and a single cluster. The split is overkill.
  • Platform team is saturated. If the cost of an extra cluster is real to them, they'll push back, and the pattern breaks without their continued ownership. See concepts/platform-team-bottleneck.
  • Observability team is large enough to operate its own Kubernetes. Some orgs prefer full ownership; the pattern is then replaced with self-operated observability clusters.

Substrate requirements

  • A platform team that actively operates Kubernetes at org scale, not as a side project.
  • Clear ownership split: cluster substrate = platform team; workloads on the cluster = observability team.
  • Coordination discipline around changes. The Airbnb post names this: "we coordinate changes with the Cloud team so that only one major change lands at a time, and so that changes are validated on lower-priority clusters before reaching operational clusters." The pattern assumes this discipline; without it, a cluster-wide Cloud-team change can still take down the observability cluster simultaneously with the product cluster.
  • Cluster count that matches the isolation requirement. Airbnb plurally references "dedicated Kubernetes clusters" — more than one.

Canonical instance (Airbnb)

From the 2026-05-05 post:

"Our 'just right' solution was to isolate our workloads onto dedicated Kubernetes clusters. These clusters aren't shared with product or infrastructure applications, but they're still administered and maintained by the Cloud team. This preserved Kubernetes as a managed foundation while reducing shared failure domains."

Explicit rejection of the two polar alternatives:

  • Shared production clusters: "minimized our operational overhead, but tightly coupled us to the very applications we needed to monitor."
  • Self-operated clusters: "provided full isolation but required deep operational expertise and ongoing maintenance — work the small but mighty Observability team wasn't eager to take on."

Relationship to adjacent patterns

Failure modes

  • Ownership drift. If the platform team deprioritises operational investment in the dedicated observability clusters ("those are the observability team's problem"), the pattern collapses. Explicit ownership is load-bearing.
  • Single-change discipline lost. Applying major changes to observability clusters simultaneously with product clusters negates the fault-isolation benefit.
  • Cluster substrate drift. If the observability cluster runs a different Kubernetes version / CNI / storage driver than the product cluster, the isolation becomes a divergence risk (security patches, feature support).
  • Cross-cluster dependencies. If the observability cluster depends on a shared control plane (IAM, DNS, secret manager) that also hosts product, cluster-level isolation is partial.

Caveats

  • Isolation is at the cluster level. Cross-cluster dependencies — shared IAM, shared networking, shared DNS, shared cloud-provider region — still couple the clusters. The pattern handles cluster-bounded failures; broader cloud-provider failures need DMS-class escape hatches.
  • "Dedicated" doesn't mean "isolated from each other". Observability clusters still federate with each other (Promxy, etc.) and share some observability-internal control plane. Intra-observability- fleet fault isolation is a separate problem.
  • Airbnb doesn't disclose cluster count or cluster size.

Seen in

  • sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's Observability team runs workloads on dedicated Kubernetes clusters, not shared with product / infrastructure apps; the Cloud team administers and maintains them. The team coordinates major changes with Cloud to keep changes serial and validated on lower-priority clusters first.
Last updated · 451 distilled / 1,324 read