AIRBNB 2026-05-05 Tier 2

Airbnb — Monitoring reliably at scale¶

Summary¶

Abdurrahman J. Allawala (Airbnb Engineering, 2026-05-05) describes how Airbnb's Observability team broke circular dependencies in its metrics platform so the monitoring stack stays up when the systems it monitors go down. The post articulates three discrete remediations across compute, networking, and meta-monitoring: (1) run observability workloads on dedicated-but-managed Kubernetes clusters (isolation without the operational overhead of self-running Kubernetes); (2) build a custom Envoy-based L7 ingress tier for telemetry traffic rather than carrying it through Airbnb's shared Istio service mesh; (3) add a dedicated meta-monitoring layer of HA Prometheus instances pinned away from shared fault domains, terminated by a Dead Man's Switch (always-firing alert → AWS SNS → CloudWatch alarm on message rate) so the observability of the observability stack does not regress into infinite monitoring regress. The underlying design bar is explicit: "treat monitoring as a production system whose availability must exceed that of what it observes."

Key takeaways¶

Circular dependencies in observability are a named reliability risk. "What happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage." (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale). Airbnb traced the risk back to metrics pipelines built on the same shared platforms (Kubernetes, service mesh) they were meant to observe. Framed as circular dependency — same structural shape the deployment-context page canonicalised, now instantiated at the observability- substrate altitude.
Dedicated-but-managed Kubernetes clusters: the "just right" middle option. The team rejected both poles — sharing production clusters with the workloads being monitored (couples observability to the apps) and running their own Kubernetes ("required deep operational expertise and ongoing maintenance — work the small but mighty Observability team wasn't eager to take on"). The middle option: "isolate our workloads onto dedicated Kubernetes clusters. These clusters aren't shared with product or infrastructure applications, but they're still administered and maintained by the Cloud team." Coordination discipline around rollouts: "we coordinate changes with the Cloud team so that only one major change lands at a time, and so that changes are validated on lower-priority clusters before reaching operational clusters." Canonical instance of concepts/dedicated-but-managed-infrastructure and patterns/dedicated-observability-kubernetes-clusters.
Observability traffic has different shape from business traffic — which forces a separate network tier. "At Airbnb's scale, we send orders of magnitude more observability traffic than business traffic." The shared Istio service mesh was designed around business workloads, not telemetry push. Two coupled problems: first, telemetry through the mesh would be a circular dependency ("metrics for the data plane would depend on that same data plane to be delivered"); second, "as usage grew, congestion could make metrics unavailable, eroding critical debuggability... Worse, telemetry spikes could also consume shared capacity and degrade or disrupt application traffic, directly impacting Airbnb.com availability." Canonical instance of concepts/observability-traffic-volume-asymmetry.
Custom L7 Envoy ingress with header-based tenant routing. "We built a custom Layer 7 network ingress layer based on Envoy that load-balances traffic and routes read and write requests to the right backends. Running this proxy independent of the shared compute layer added fault tolerance and shielded our ingest path from service-mesh failures." Tenancy model: "Airbnb runs over 1,000 services, each mapped to its own tenant in a single, global user space. Our custom load-balancing tier makes this practical: we map each service name to a specific cluster backend, and every request must include a tenant header that informs routing." Extensibility win: "we can mirror metrics to alternate destinations for testing or enforce fine-grained access controls, which is critical when working with external vendors or specialized use cases." Canonical instance of patterns/custom-l7-proxy-for-telemetry-over-service-mesh and concepts/tenant-header-routing.
Compute vs networking: why own one but not the other. The post is explicit about the asymmetric build-vs-adopt decision: "For compute, Kubernetes was already a mature, managed foundation... adding dedicated clusters for observability was a relatively small increment to their existing footprint. The networking layer was different: our service mesh couldn't cleanly isolate and prioritize observability traffic from business traffic at our scale, and the features we needed — strict prioritization, isolation, and custom routing for telemetry — sat squarely within our team's domain." The operational-cost argument: "Owning this layer gave us the control we wanted and, compared to running Kubernetes ourselves, it was a much more straightforward surface to operate."
Meta-monitoring: separate Prometheus instances watching the observability stack. "We run a separate set of Prometheus instances dedicated to monitoring our observability stack." Fault-domain isolation discipline: "To avoid correlated failures, they run on Kubernetes nodes isolated from the observability stack and in different availability zones. Each Prometheus instance is part of a high-availability set, as are the corresponding Alertmanagers, and we ensure no Prometheus–Alertmanager pair can land on the same shared infrastructure, further reducing shared fault domains." Canonical instance of concepts/meta-monitoring and patterns/ha-set-anti-affinity-across-shared-infra.
Dead Man's Switch: heartbeat absence as the alert trigger, externalised to AWS. To avoid infinite monitoring regress ("how do we know if the meta-monitoring layer is down? Spinning up yet another monitoring stack would just lead to an infinite regress"), Airbnb uses a dead-man's-switch: "We maintain an alerting rule that always fires as long as Prometheus is scraping correctly. Alertmanager continuously sends these alerts to an external AWS SNS topic, and a CloudWatch alarm monitors the rate of incoming messages. If they stop — because Prometheus is down, scraping has stalled, Alertmanager can't send, or something else has degraded — the CloudWatch alarm triggers and on-call is paged." Structural property: the watcher is on a different cloud-native control plane (AWS managed services) than the observability stack it watches. Canonical instance of patterns/heartbeat-absence-as-alert-trigger.
The design bar, stated explicitly. "Treat monitoring as a production system whose availability must exceed that of what it observes." Generalisation: "any organization can improve monitoring reliability by mapping critical dependencies, intentionally isolating failure domains, and ensuring there is always an independent path for the signals that drive paging and incident response."

Architectural numbers¶

Over 1,000 services at Airbnb, each mapped to its own tenant in a single global user space at the metrics platform.
"Orders of magnitude more observability traffic than business traffic" — the asymmetry that forced the custom network tier.
Scale anchor from prior Airbnb post (sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system): ~50M samples/sec, 1.3B active time series, 2.5 PB logical data across the storage plane this networking plane fronts.

Systems and concepts extracted¶

systems/airbnb-observability-platform — the overall in-house metrics platform; this post adds the reliability axis (dedicated compute cluster + custom L7 ingress + meta-monitoring with DMS) on top of the storage/ingestion axes previously canonicalised.
systems/envoy — new role: telemetry-ingress L7 tier running independent of Istio, with header-based tenant routing for ~1,000 services; adds an eighth Envoy role to the wiki taxonomy.
Istio — named and explicitly rejected as the carrier for observability traffic; same service-mesh substrate that remains fine for business traffic.
systems/prometheus — canonical meta-monitoring deployment shape (HA pairs pinned to distinct nodes + AZs).
systems/alertmanager — canonical meta-monitoring delivery tier (stub prior to this ingest); HA pair anti-affinity with its paired Prometheus.
systems/aws-sns — used as the external termination channel for a dead-man's-switch heartbeat; SNS topic receives the always-firing alert continuously.
systems/aws-cloudwatch — rate-based alarm on SNS message arrival is the actual trigger that pages on-call when the heartbeat stops; new role as observability-stack watchdog on an independent control plane.
systems/kubernetes — substrate; dedicated clusters for observability + Cloud-team-maintained as the middle ground.
concepts/circular-dependency — extended from the deployment-context instance to the observability-substrate instance; three concrete circular-dependency shapes enumerated (compute / networking / meta-monitoring).
concepts/meta-monitoring — new first-class concept: monitor the monitors.
concepts/dead-mans-switch — new first-class concept: the anchor primitive for meta-monitoring without infinite regress.
concepts/dedicated-but-managed-infrastructure — new first-class concept: the middle option between shared production clusters and self-run Kubernetes.
concepts/observability-traffic-volume-asymmetry — new first-class concept: why telemetry has different networking-layer requirements than business traffic at scale.
concepts/tenant-header-routing — new first-class concept: service-name → cluster-backend mapping via a required request header; client doesn't know topology.
patterns/custom-l7-proxy-for-telemetry-over-service-mesh — new pattern: when telemetry volume + circular-dependency risk + routing-requirements exceed what a generic service mesh can cleanly deliver, build your own L7 tier.
patterns/heartbeat-absence-as-alert-trigger — new pattern: always-firing alert → external channel → rate alarm; absence of signal is the signal.
patterns/ha-set-anti-affinity-across-shared-infra — new pattern: HA pairs (Prometheus-Alertmanager, etc.) must not land on the same node / AZ / shared-fault-domain.
patterns/dedicated-observability-kubernetes-clusters — new pattern: run observability workloads on dedicated clusters administered by the platform team, not on shared production clusters.

Caveats¶

No latency / p99 / resource-consumption numbers on the custom Envoy tier. The post claims "strict prioritization, isolation, and custom routing for telemetry" without disclosing the prioritisation mechanism's specifics or measured results.
No disclosure of the Dead Man's Switch firing cadence (every second? every minute?) or the CloudWatch alarm's exact threshold / evaluation window.
No disclosure of Prometheus–Alertmanager HA pair sizing (2-of-2? 3-of-3?) or the specific "different availability zones" count.
Meta-monitoring coverage gap not named. The post does not state what happens if AWS SNS or CloudWatch itself is degraded — the DMS's external channel is AWS, which becomes the new trust anchor. Correlation with AWS outages is not discussed.
Single author, no incident retrospective. Framed as design narrative rather than an incident-driven post- mortem; no specific outage is cited as the motivating event.
Size of the Observability team named only as "small but mighty" — no headcount disclosed.
Multi-region / cross-region failover of the observability stack not addressed. Per-AZ isolation is discussed; cross-region topology is not.