CONCEPT Cited by 1 source
Shared fault domain¶
Definition¶
A shared fault domain is any infrastructure element — a Kubernetes node, availability zone, rack, network switch, control plane, cloud region, CI pipeline, deploy system — that causes simultaneous failure of all workloads co-located on it when the element fails.
If two replicas of the same service sit on the same node, that node is their shared fault domain. If two Prometheus– Alertmanager HA pairs sit on the same AZ, that AZ is their shared fault domain. If a meta-monitoring layer runs on the same cluster as the stack it watches, the cluster is the shared fault domain — and the entire monitoring scheme is compromised.
Why it matters¶
Most availability arguments assume independent failures: two 99.9% dependencies → 99.8% combined (concepts/availability-multiplication-of-dependencies). This arithmetic is only valid to the extent their failure modes are uncorrelated.
Shared fault domains correlate failures. Two "independent" replicas on the same node fail at the same time when that node fails, so they contribute the same availability as one replica. The paper-math expectation evaporates under the shared-substrate reality.
Canonical observability instance (Airbnb)¶
Airbnb's 2026-05-05 post names shared fault domains as an explicit design concern for meta-monitoring, stating the anti-affinity discipline directly:
"To avoid correlated failures, they run on Kubernetes nodes isolated from the observability stack and in different availability zones. Each Prometheus instance is part of a high-availability set, as are the corresponding Alertmanagers, and we ensure no Prometheus–Alertmanager pair can land on the same shared infrastructure, further reducing shared fault domains."
Three levels of anti-affinity are named:
- Meta-monitoring pods vs. observability-stack pods — no shared node.
- HA Prometheus pairs — no shared AZ.
- HA Prometheus–Alertmanager pair — no shared "shared infrastructure" (catch-all for node, AZ, rack, etc.).
See patterns/ha-set-anti-affinity-across-shared-infra for the reusable pattern.
Layered fault domains (from narrowest to widest)¶
| Scope | Example | Typical isolation primitive |
|---|---|---|
| Pod / process | OOM, crash | Multiple replicas |
| Node | Kernel panic, node OOM | Anti-affinity across nodes |
| Rack / switch | ToR switch failure | Rack-aware spread |
| Availability zone | AZ power / network | Multi-AZ deployment |
| Cluster | Control-plane failure | Multiple clusters |
| Region | Region-wide incident | Multi-region |
| Control plane | Shared SaaS control | Independent control planes |
| Cloud provider | Provider-wide outage | Multi-cloud |
Each row is the fault domain for everything beneath it. A shared-node decision makes the node the fault domain for the pods on it; a shared-AZ decision makes the AZ the fault domain for all nodes in it. Isolation means "don't share the next level up."
When does shared fault domain become circular dependency?¶
If the shared fault domain is also the thing being monitored / recovered / protected, then the pattern is no longer just a correlated-failure risk — it's a circular dependency. The monitoring stack running on the cluster it's supposed to watch is both a shared-fault-domain instance (cluster failure → both) and a circular-dependency instance (cluster failure → no way to observe or fix itself).
The Airbnb post is explicit about this — the dedicated Kubernetes cluster breaks the shared-fault-domain at the cluster level, and the Dead Man's Switch on AWS breaks the shared-control-plane at the cloud-provider level.
Related concepts on the wiki¶
- concepts/circular-dependency — the sharper version where the shared substrate is also the monitored / recovered substrate.
- concepts/availability-multiplication-of-dependencies — the arithmetic that assumes independence.
- concepts/blast-radius — what the fault-domain exposes; isolation narrows the blast radius.
- concepts/sharded-failure-domain-isolation — the partitioning primitive for bounding the blast radius.
- concepts/availability-zone-balance — AZ-specific anti-affinity discipline.
Anti-patterns¶
- "Multi-replica" without anti-affinity — replicas on the same node are one-replica-shaped under node failure.
- "Multi-AZ" on a single control plane — regional control-plane outages take all AZs.
- Meta-monitoring on the same cluster as monitored workloads — Airbnb's canonical example of what not to do.
- HA pairs where both members depend on the same external dependency — the dependency is the real fault domain; the HA pair is decorative.
Caveats¶
- "Independent" is a spectrum. Two AZs in the same region share a region-control-plane; two regions in the same cloud share the cloud's IAM / billing / global services. True zero-correlation only shows up at the multi-cloud / multi-organisation level.
- Anti-affinity rules must be enforced. Kubernetes soft-anti-affinity can be overridden by the scheduler if nodes are scarce; hard anti-affinity can leave pods unscheduled. The right choice depends on the failure cost.
- Fault-domain mapping drifts. As the cluster grows, new nodes / AZs / racks appear and old anti-affinity rules can become stale. Regular audit is needed.
Seen in¶
- sources/2026-05-05-airbnb-monitoring-reliably-at-scale — Airbnb names the shared-fault-domain discipline for meta-monitoring (node + AZ + HA-pair anti-affinity) and composes it with a Dead Man's Switch to break the control-plane fault domain.