SYSTEM Cited by 5 sources

Airbnb observability platform¶

Airbnb's in-house metrics platform, built on Prometheus and PromQL, replacing a third-party vendor-managed system after a ~5-year migration. Scope covers instrumentation → collection → storage → visualization → alerting end to end, so Airbnb owns the full lifecycle of metrics.

Scale (at migration completion)¶

~1,000 services
~300M timeseries
~3,100 dashboards
~300,000 alerts

Components¶

Instrumentation: OTLP preferred for internal services, Prometheus for OSS workloads, StatsD (DogStatsD format) as legacy fallback. Migration from StatsD happened via a dual-write shared metrics library (both protocols at once) — see patterns/dual-write-migration. JVM metrics CPU dropped 10% → <1% of samples post-OTLP cutover. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
Temporality: cumulative by default; select high-volume emitters use delta temporality to bound SDK memory (see concepts/metric-temporality).
Collection: OpenTelemetry Collector as the vendor-neutral ingress; replaced the previous StatsD/Veneur-sidecar path.
Streaming-aggregation tier: two-tier vmagent deployment — stateless routers consistent-hash on non-aggregated labels, shard to stateful aggregators (StatefulSet) that keep running totals. Scales to hundreds of aggregators / 100M+ samples/sec per cluster with ~10× cost reduction vs. storing raw metrics. See systems/vmagent and concepts/streaming-aggregation. Also functions as a metric-wide control point: drop bad metrics, dual- emit raw metrics on demand, inject patterns/zero-injection-counter seeds for sparse counters.
Storage / query engine: Prometheus-based. PromQL as the user-facing query language.
Translation layer: automated translators that moved legacy dashboards and alerts into the new system during migration. Translates intent (e.g., canonical histogram query for any p95 request) rather than doing a literal query port — see patterns/intent-preserving-query-translation.
Metadata engine (inside the translation layer): periodically scans all metrics and maintains a reliable metric → type mapping (counter / histogram / gauge) using an internal _otel_metric_type_ label, since Airbnb kept legacy metric names instead of renaming to Prometheus conventions. See concepts/metric-type-metadata.
AI tooling: in-house LLM skills seeded with the metadata engine's type/unit info, so agents can generate correct PromQL with minimal manual effort. Used for incident diagnosis and dashboard bootstrapping.
New alert-authoring framework (from the Reliability XP team): treats alerts as code — autocomplete, builder-style query help, historical backtesting ("when would this have fired?"), diffing of changes before deploy. Replaces a legacy sparsely-documented config-file style. See patterns/alerts-as-code.
Change Report + bulk-backtest backend (detail from sources/2026-03-04-airbnb-alert-backtesting-change-reports): hooks directly into Prometheus's rules/manager.go rule-evaluation engine (compatibility-over-novelty), writes backtest results as Prometheus time series blocks exposed via the standard range-query API. Each backtest runs in its own Kubernetes pod with autoscaling; concurrency limits + error thresholds + multiple circuit breakers prevent cascading failures. Runs at full-diff granularity (hundreds to thousands of alerts per PR), typical window 30 days, surfaces a computed "noisiness" metric + firing-count timeline sortable in the Change Report UI. Modified recording-rule dependencies are highlighted in the UI as a guided two-step flow rather than resolved by the simulator. Change Reports post automatically on every PR (via CLI or CI). See patterns/alert-backtesting.
New visualization tool: replaced the vendor UI as part of the migration (specific tool not named in the post).

Storage plane (2026-04-21 deep-dive)¶

The storage tier under the interaction layer gets its own post on 2026-04-21 (Rishabh Kumar) — see systems/airbnb-metrics-storage for the dedicated system page. Scale numbers and key properties:

50M samples/sec ingestion; 1.3B active time series; 2.5 PB logical data; 10K dashboards / 500K alerts; p99 query <30s SLO; >99.9% availability.
Tenant-per-application (~1,000 services = ~1,000 tenants) with exposed series limits and derived ingest-rate / burst-size limits; consolidated control plane auto-onboards new services as tenants. See patterns/tenant-per-application.
concepts/shuffle-sharding isolates read and write paths — each tenant hashes to a subset of ingesters and a subset of query workers. A DDoS or runaway query from one tenant saturates that tenant's shuffle set; others are unaffected.
Single-cluster reliability investments: per-replica benchmarking and limits, query sharding, compaction sharding (8M series/worker for large tenants), three-zone stateful deployments, separated evaluation and ad-hoc query paths, autoscaled read tier.
Multi-cluster federation for blast-radius reduction: workload-segregated clusters (compute / mesh / application tiers) connected via a custom systems/promxy build with native- histogram support and query-fanout optimization. See concepts/active-multi-cluster-blast-radius and concepts/cross-cluster-federated-query-cost (5–10× cost tax).
Progressive cluster rollout from test → internal → application → infrastructure clusters, sequenced by criticality. See patterns/progressive-cluster-rollout.
Grafana Kubernetes rollout operators (customised for Airbnb's pod-disruption-budget requirements) replaced multi-day manual StatefulSet deploys — enables "clusters as cattle, not pets" operational philosophy.

Reliability plane (2026-05-05 deep-dive)¶

The reliability plane of the platform — how the stack stays up when the systems it monitors go down — is the focus of Abdurrahman J. Allawala's 2026-05-05 post "Monitoring reliably at scale" (sources/2026-05-05-airbnb-monitoring-reliably-at-scale). Three distinct remedies against circular dependencies:

Compute: dedicated-but-managed Kubernetes clusters¶

The observability stack no longer runs on shared product / infrastructure clusters. Verbatim: "We isolate our workloads onto dedicated Kubernetes clusters. These clusters aren't shared with product or infrastructure applications, but they're still administered and maintained by the Cloud team." This is the canonical realisation of concepts/dedicated-but-managed-infrastructure — isolation without the operational cost of self-running K8s. See patterns/dedicated-observability-kubernetes-clusters. Operational discipline: "we coordinate changes with the Cloud team so that only one major change lands at a time, and so that changes are validated on lower-priority clusters before reaching operational clusters."

Networking: custom Envoy L7 ingress (not Istio)¶

Observability traffic no longer rides the shared Istio service mesh. The team built a custom Envoy-based L7 ingress tier independent of the mesh, with header-based tenant routing mapping ~1,000 Airbnb services → cluster backends. Motivating asymmetry: "orders of magnitude more observability traffic than business traffic", and carrying it on the shared mesh created (1) a circular dependency for mesh-plane metrics, (2) congestion-induced blindness on telemetry, and (3) telemetry spikes as a noisy-neighbour that could "consume shared capacity and degrade or disrupt application traffic, directly impacting Airbnb.com availability." Extensibility hooks the custom tier provides: metric mirroring to alternate destinations for testing, fine-grained ACLs for external vendors. Canonicalises an eighth Envoy role on the wiki (telemetry-ingress). See patterns/custom-l7-proxy-for-telemetry-over-service-mesh.

Meta-monitoring: HA Prometheus / Alertmanager + Dead Man's Switch¶

A separate Prometheus fleet watches the observability stack itself (concepts/meta-monitoring). HA Prometheus + Alertmanager pairs run on K8s nodes isolated from the main stack, in different AZs; no Prometheus–Alertmanager pair shares "shared infrastructure" with another pair — three levels of anti-affinity. To terminate the "who watches the watcher" regress, the meta-tier emits a continuous dead-man's-switch heartbeat: an always-firing Prometheus rule → Alertmanager → external AWS SNS topic → CloudWatch rate alarm → on-call page when the heartbeat stops. The external AWS control plane is distinct from the K8s-hosted observability stack, so a cluster-wide incident cannot silence the watchdog. See patterns/heartbeat-absence-as-alert-trigger.

Design bar¶

"Treat monitoring as a production system whose availability must exceed that of what it observes." Captured on the wiki under concepts/observability → "Reliability of the observability stack itself".

Design choices worth noting¶

Preserve legacy metric names rather than forcing renames on 1,000 services — forces the metadata engine to exist but keeps code and telemetry in sync.
Adopt PromQL wholesale rather than keep a permanent compatibility shim: short-term unfamiliarity, long-term ecosystem + LLM leverage.
Expand scope mid-flight to include the alert framework, once it became clear that the migration's value depended on fixing authoring UX, not just swapping storage.
Own the interaction layer. The team's top lesson: visualization, alert authoring, and dashboard workflows are where switching cost lives; owning them makes future backend swaps incremental instead of organization-wide.

Seen in¶

sources/2026-03-17-airbnb-observability-ownership-migration — migration retrospective, strategy lessons, and the metadata engine.
sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — collection + aggregation tier deep-dive (OTLP dual-write, vmagent routers/aggregators at 100M+ samples/sec, delta temporality for top emitters, zero injection for sparse counters).
sources/2026-03-04-airbnb-alert-backtesting-change-reports — deep-dive on the Change Report + bulk-backtest backend. Key new architectural facts: Prometheus rules/manager.go integration point, Prometheus-time-series-block output format, per-backtest Kubernetes pod isolation, full-diff scope. Impact: 300K alerts migrated from a vendor to Prometheus, 90% reduction in company-wide alert noise, iteration cycles month → afternoon.
sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — storage-plane deep-dive. Shuffle-sharding for tenant isolation on both reads and writes; tenant-per-application mapping with consolidated control-plane onboarding; single-cluster reliability (per-replica limits, query sharding, 8M-series compaction workers, three-zone stateful deploys); multi-cluster federation via custom Promxy with native-histogram support + query-fanout optimization; progressive cluster rollout (test → internal → app → infra) for

99.9% availability; 5–10× federated-query cost amplification; Grafana K8s rollout operators replacing multi-day manual deploys.