Skip to content

CONCEPT Cited by 1 source

Shared fault domain

Definition

A shared fault domain is any infrastructure element — a Kubernetes node, availability zone, rack, network switch, control plane, cloud region, CI pipeline, deploy system — that causes simultaneous failure of all workloads co-located on it when the element fails.

If two replicas of the same service sit on the same node, that node is their shared fault domain. If two Prometheus– Alertmanager HA pairs sit on the same AZ, that AZ is their shared fault domain. If a meta-monitoring layer runs on the same cluster as the stack it watches, the cluster is the shared fault domain — and the entire monitoring scheme is compromised.

Why it matters

Most availability arguments assume independent failures: two 99.9% dependencies → 99.8% combined (concepts/availability-multiplication-of-dependencies). This arithmetic is only valid to the extent their failure modes are uncorrelated.

Shared fault domains correlate failures. Two "independent" replicas on the same node fail at the same time when that node fails, so they contribute the same availability as one replica. The paper-math expectation evaporates under the shared-substrate reality.

Canonical observability instance (Airbnb)

Airbnb's 2026-05-05 post names shared fault domains as an explicit design concern for meta-monitoring, stating the anti-affinity discipline directly:

"To avoid correlated failures, they run on Kubernetes nodes isolated from the observability stack and in different availability zones. Each Prometheus instance is part of a high-availability set, as are the corresponding Alertmanagers, and we ensure no Prometheus–Alertmanager pair can land on the same shared infrastructure, further reducing shared fault domains."

Three levels of anti-affinity are named:

  1. Meta-monitoring pods vs. observability-stack pods — no shared node.
  2. HA Prometheus pairs — no shared AZ.
  3. HA Prometheus–Alertmanager pair — no shared "shared infrastructure" (catch-all for node, AZ, rack, etc.).

See patterns/ha-set-anti-affinity-across-shared-infra for the reusable pattern.

Layered fault domains (from narrowest to widest)

Scope Example Typical isolation primitive
Pod / process OOM, crash Multiple replicas
Node Kernel panic, node OOM Anti-affinity across nodes
Rack / switch ToR switch failure Rack-aware spread
Availability zone AZ power / network Multi-AZ deployment
Cluster Control-plane failure Multiple clusters
Region Region-wide incident Multi-region
Control plane Shared SaaS control Independent control planes
Cloud provider Provider-wide outage Multi-cloud

Each row is the fault domain for everything beneath it. A shared-node decision makes the node the fault domain for the pods on it; a shared-AZ decision makes the AZ the fault domain for all nodes in it. Isolation means "don't share the next level up."

When does shared fault domain become circular dependency?

If the shared fault domain is also the thing being monitored / recovered / protected, then the pattern is no longer just a correlated-failure risk — it's a circular dependency. The monitoring stack running on the cluster it's supposed to watch is both a shared-fault-domain instance (cluster failure → both) and a circular-dependency instance (cluster failure → no way to observe or fix itself).

The Airbnb post is explicit about this — the dedicated Kubernetes cluster breaks the shared-fault-domain at the cluster level, and the Dead Man's Switch on AWS breaks the shared-control-plane at the cloud-provider level.

Anti-patterns

  • "Multi-replica" without anti-affinity — replicas on the same node are one-replica-shaped under node failure.
  • "Multi-AZ" on a single control plane — regional control-plane outages take all AZs.
  • Meta-monitoring on the same cluster as monitored workloads — Airbnb's canonical example of what not to do.
  • HA pairs where both members depend on the same external dependency — the dependency is the real fault domain; the HA pair is decorative.

Caveats

  • "Independent" is a spectrum. Two AZs in the same region share a region-control-plane; two regions in the same cloud share the cloud's IAM / billing / global services. True zero-correlation only shows up at the multi-cloud / multi-organisation level.
  • Anti-affinity rules must be enforced. Kubernetes soft-anti-affinity can be overridden by the scheduler if nodes are scarce; hard anti-affinity can leave pods unscheduled. The right choice depends on the failure cost.
  • Fault-domain mapping drifts. As the cluster grows, new nodes / AZs / racks appear and old anti-affinity rules can become stale. Regular audit is needed.

Seen in

Last updated · 451 distilled / 1,324 read