Skip to content

SYSTEM Cited by 3 sources

Airbnb observability platform

Airbnb's in-house metrics platform, built on Prometheus and PromQL, replacing a third-party vendor-managed system after a ~5-year migration. Scope covers instrumentation → collection → storage → visualization → alerting end to end, so Airbnb owns the full lifecycle of metrics.

Scale (at migration completion)

  • ~1,000 services
  • ~300M timeseries
  • ~3,100 dashboards
  • ~300,000 alerts

Components

  • Instrumentation: OTLP preferred for internal services, Prometheus for OSS workloads, StatsD (DogStatsD format) as legacy fallback. Migration from StatsD happened via a dual-write shared metrics library (both protocols at once) — see patterns/dual-write-migration. JVM metrics CPU dropped 10% → <1% of samples post-OTLP cutover. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
  • Temporality: cumulative by default; select high-volume emitters use delta temporality to bound SDK memory (see concepts/metric-temporality).
  • Collection: OpenTelemetry Collector as the vendor-neutral ingress; replaced the previous StatsD/Veneur-sidecar path.
  • Streaming-aggregation tier: two-tier vmagent deployment — stateless routers consistent-hash on non-aggregated labels, shard to stateful aggregators (StatefulSet) that keep running totals. Scales to hundreds of aggregators / 100M+ samples/sec per cluster with ~10× cost reduction vs. storing raw metrics. See systems/vmagent and concepts/streaming-aggregation. Also functions as a metric-wide control point: drop bad metrics, dual- emit raw metrics on demand, inject patterns/zero-injection-counter seeds for sparse counters.
  • Storage / query engine: Prometheus-based. PromQL as the user-facing query language.
  • Translation layer: automated translators that moved legacy dashboards and alerts into the new system during migration. Translates intent (e.g., canonical histogram query for any p95 request) rather than doing a literal query port — see patterns/intent-preserving-query-translation.
  • Metadata engine (inside the translation layer): periodically scans all metrics and maintains a reliable metric → type mapping (counter / histogram / gauge) using an internal _otel_metric_type_ label, since Airbnb kept legacy metric names instead of renaming to Prometheus conventions. See concepts/metric-type-metadata.
  • AI tooling: in-house LLM skills seeded with the metadata engine's type/unit info, so agents can generate correct PromQL with minimal manual effort. Used for incident diagnosis and dashboard bootstrapping.
  • New alert-authoring framework (from the Reliability XP team): treats alerts as code — autocomplete, builder-style query help, historical backtesting ("when would this have fired?"), diffing of changes before deploy. Replaces a legacy sparsely-documented config-file style. See patterns/alerts-as-code.
  • Change Report + bulk-backtest backend (detail from sources/2026-03-04-airbnb-alert-backtesting-change-reports): hooks directly into Prometheus's rules/manager.go rule-evaluation engine (compatibility-over-novelty), writes backtest results as Prometheus time series blocks exposed via the standard range-query API. Each backtest runs in its own Kubernetes pod with autoscaling; concurrency limits + error thresholds + multiple circuit breakers prevent cascading failures. Runs at full-diff granularity (hundreds to thousands of alerts per PR), typical window 30 days, surfaces a computed "noisiness" metric + firing-count timeline sortable in the Change Report UI. Modified recording-rule dependencies are highlighted in the UI as a guided two-step flow rather than resolved by the simulator. Change Reports post automatically on every PR (via CLI or CI). See patterns/alert-backtesting.
  • New visualization tool: replaced the vendor UI as part of the migration (specific tool not named in the post).

Design choices worth noting

  • Preserve legacy metric names rather than forcing renames on 1,000 services — forces the metadata engine to exist but keeps code and telemetry in sync.
  • Adopt PromQL wholesale rather than keep a permanent compatibility shim: short-term unfamiliarity, long-term ecosystem + LLM leverage.
  • Expand scope mid-flight to include the alert framework, once it became clear that the migration's value depended on fixing authoring UX, not just swapping storage.
  • Own the interaction layer. The team's top lesson: visualization, alert authoring, and dashboard workflows are where switching cost lives; owning them makes future backend swaps incremental instead of organization-wide.

Seen in

Last updated · 200 distilled / 1,178 read