PATTERN Cited by 1 source

HA set anti-affinity across shared infra¶

Pattern¶

When running a high-availability (HA) set (two or more replicas of a service, or a tightly-coupled HA pair of different services), enforce that no two members of the set share any fault domain — not the same node, not the same AZ, not the same rack, not the same cluster, not the same control plane, not any shared dependency.

"Shared infra" is the catch-all: whatever the most-granular shared failure unit is, the HA set's members must not share it.

Why "shared-rack" / "shared-node" isn't enough¶

The common Kubernetes pattern is podAntiAffinity on the same topologyKey: kubernetes.io/hostname — preventing two replicas from landing on the same node. That's necessary but not sufficient. If two replicas land on two different nodes in the same AZ, and that AZ fails, both replicas fail.

The pattern extends to all levels of infrastructure the replicas might share:

Same Kubernetes node
Same rack / ToR switch
Same availability zone
Same cluster
Same control plane / API server
Same dependent service (same DB instance, same config store, etc.)
Same cloud region

At each level, the question is: "if this thing fails, how many members of my HA set go down?". The answer should be at most one, except at the failure level you've chosen not to defend against (e.g., at the cloud-region level if you're single-region).

Airbnb's explicit articulation¶

From the 2026-05-05 post (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale), three levels of anti-affinity are enforced:

"To avoid correlated failures, they run on Kubernetes nodes isolated from the observability stack and in different availability zones. Each Prometheus instance is part of a high-availability set, as are the corresponding Alertmanagers, and we ensure no Prometheus–Alertmanager pair can land on the same shared infrastructure, further reducing shared fault domains."

Layer 1 — observability-stack isolation: meta-monitoring pods ≠ observability-stack pods on the same node.
Layer 2 — AZ anti-affinity: HA Prometheus replicas live in different AZs.
Layer 3 — HA-pair anti-affinity: a Prometheus and its paired Alertmanager must not share infrastructure, so the pair fails independently.

Note that Layer 3 applies to a pair of different services acting as a unit, not to replicas of the same service. This is a less-common but important anti-affinity discipline: if Prometheus-A and Alertmanager-A are intended to cover the same service, and both go down simultaneously, that coverage is gone. Even though each pair member has its own HA replicas, the pair-as-a-unit should fail independently of other pairs.

When it fits¶

Any HA set where simultaneous failure of all members is unacceptable. Core databases, alerting systems, auth systems, meta-monitoring, etc.
Monitoring / alerting systems specifically, because they are the vehicle for detecting other failures. If the monitoring HA set shares fault domain with the systems it monitors, the pattern is broken.
Operationally when the platform supports anti-affinity primitives — Kubernetes topologyKey rules, hypervisor anti-affinity groups, cloud-provider placement groups.

When it's harder than it looks¶

Small clusters. A 3-node cluster can't maintain meaningful anti-affinity for a 3-replica HA set plus meta-monitoring plus other critical workloads.
Scarce AZs. In regions with only 2 AZs, 3-replica HA sets must share at least one AZ. The trade-off is explicit: replica count > AZ count costs you AZ-level independence.
Cost. Enforcing strong anti-affinity can cost scheduling efficiency (empty slots waiting for the right affinity to open up) and bigger clusters (to have enough unique fault-domain slots).
Drift over time. New nodes / AZs / racks appear and disappear. Anti-affinity rules must be audited periodically or they become stale.

Implementation on Kubernetes¶

podAntiAffinity.requiredDuringSchedulingIgnoredDuringExecution with topologyKey: kubernetes.io/hostname for node anti-affinity.
topologyKey: topology.kubernetes.io/zone for AZ anti-affinity.
Custom topology keys for rack / PSU / switch if your node labels expose them.
matchLabels selecting the HA-set peers (e.g., {app: prometheus, tier: meta-monitoring} for replica-level; a broader selector for HA-pair-level across two different app labels).
Soft (preferred) vs hard (required) is a trade-off between schedulability and strict isolation. Meta- monitoring typically wants hard.

Failure modes¶

Soft anti-affinity silently overridden. preferred rules can be violated if the scheduler can't find a better placement. Without alerts on violation, you may think you have isolation when you don't.
Stale topology labels. New nodes without proper zone labels can satisfy anti-affinity rules vacuously — they're "different" because the label is missing / "" which technically satisfies topologyKey constraints.
HA set starved of placements. Too-strict rules with too-few fault domains leave replicas unscheduled, which is a different failure mode (no HA at all because half the set never started).
Cross-HA-pair anti-affinity missed. Teams enforce replica anti-affinity but not pair anti-affinity, and a Prometheus-and-Alertmanager pair for a particular service ends up on the same host as the pair for another service — collapsing two independent pairs into a shared fault domain.

Caveats¶

Anti-affinity is a necessary condition, not a sufficient one. Replicas on different nodes that depend on the same external dependency (same DB, same LB, same control plane) still share a fault domain at that dependency. True independence requires auditing the whole dependency graph.
Airbnb's post does not disclose replica counts or AZ counts for the meta-monitoring HA sets.
"Shared infrastructure" is fuzzy; what counts as shared depends on the failure mode you're defending against. Defining it precisely per HA set is part of the work.

Seen in¶

sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's meta-monitoring HA sets enforce three levels of anti-affinity: node-level between meta-monitoring and the observability stack, AZ-level between Prometheus replicas, and pair-level between Prometheus-Alertmanager pairs as units.