SYSTEM Cited by 3 sources
Airbnb observability platform¶
Airbnb's in-house metrics platform, built on Prometheus and PromQL, replacing a third-party vendor-managed system after a ~5-year migration. Scope covers instrumentation → collection → storage → visualization → alerting end to end, so Airbnb owns the full lifecycle of metrics.
Scale (at migration completion)¶
- ~1,000 services
- ~300M timeseries
- ~3,100 dashboards
- ~300,000 alerts
Components¶
- Instrumentation: OTLP preferred for internal services, Prometheus for OSS workloads, StatsD (DogStatsD format) as legacy fallback. Migration from StatsD happened via a dual-write shared metrics library (both protocols at once) — see patterns/dual-write-migration. JVM metrics CPU dropped 10% → <1% of samples post-OTLP cutover. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
- Temporality: cumulative by default; select high-volume emitters use delta temporality to bound SDK memory (see concepts/metric-temporality).
- Collection: OpenTelemetry Collector as the vendor-neutral ingress; replaced the previous StatsD/Veneur-sidecar path.
- Streaming-aggregation tier: two-tier vmagent deployment — stateless routers consistent-hash on non-aggregated labels, shard to stateful aggregators (StatefulSet) that keep running totals. Scales to hundreds of aggregators / 100M+ samples/sec per cluster with ~10× cost reduction vs. storing raw metrics. See systems/vmagent and concepts/streaming-aggregation. Also functions as a metric-wide control point: drop bad metrics, dual- emit raw metrics on demand, inject patterns/zero-injection-counter seeds for sparse counters.
- Storage / query engine: Prometheus-based. PromQL as the user-facing query language.
- Translation layer: automated translators that moved legacy dashboards and alerts into the new system during migration. Translates intent (e.g., canonical histogram query for any p95 request) rather than doing a literal query port — see patterns/intent-preserving-query-translation.
- Metadata engine (inside the translation layer): periodically scans
all metrics and maintains a reliable metric → type mapping
(counter / histogram / gauge) using an internal
_otel_metric_type_label, since Airbnb kept legacy metric names instead of renaming to Prometheus conventions. See concepts/metric-type-metadata. - AI tooling: in-house LLM skills seeded with the metadata engine's type/unit info, so agents can generate correct PromQL with minimal manual effort. Used for incident diagnosis and dashboard bootstrapping.
- New alert-authoring framework (from the Reliability XP team): treats alerts as code — autocomplete, builder-style query help, historical backtesting ("when would this have fired?"), diffing of changes before deploy. Replaces a legacy sparsely-documented config-file style. See patterns/alerts-as-code.
- Change Report + bulk-backtest backend
(detail from sources/2026-03-04-airbnb-alert-backtesting-change-reports):
hooks directly into Prometheus's
rules/manager.gorule-evaluation engine (compatibility-over-novelty), writes backtest results as Prometheus time series blocks exposed via the standard range-query API. Each backtest runs in its own Kubernetes pod with autoscaling; concurrency limits + error thresholds + multiple circuit breakers prevent cascading failures. Runs at full-diff granularity (hundreds to thousands of alerts per PR), typical window 30 days, surfaces a computed "noisiness" metric + firing-count timeline sortable in the Change Report UI. Modified recording-rule dependencies are highlighted in the UI as a guided two-step flow rather than resolved by the simulator. Change Reports post automatically on every PR (via CLI or CI). See patterns/alert-backtesting. - New visualization tool: replaced the vendor UI as part of the migration (specific tool not named in the post).
Design choices worth noting¶
- Preserve legacy metric names rather than forcing renames on 1,000 services — forces the metadata engine to exist but keeps code and telemetry in sync.
- Adopt PromQL wholesale rather than keep a permanent compatibility shim: short-term unfamiliarity, long-term ecosystem + LLM leverage.
- Expand scope mid-flight to include the alert framework, once it became clear that the migration's value depended on fixing authoring UX, not just swapping storage.
- Own the interaction layer. The team's top lesson: visualization, alert authoring, and dashboard workflows are where switching cost lives; owning them makes future backend swaps incremental instead of organization-wide.
Seen in¶
- sources/2026-03-17-airbnb-observability-ownership-migration — migration retrospective, strategy lessons, and the metadata engine.
- sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — collection + aggregation tier deep-dive (OTLP dual-write, vmagent routers/aggregators at 100M+ samples/sec, delta temporality for top emitters, zero injection for sparse counters).
- sources/2026-03-04-airbnb-alert-backtesting-change-reports —
deep-dive on the Change Report + bulk-backtest backend. Key new
architectural facts: Prometheus
rules/manager.gointegration point, Prometheus-time-series-block output format, per-backtest Kubernetes pod isolation, full-diff scope. Impact: 300K alerts migrated from a vendor to Prometheus, 90% reduction in company-wide alert noise, iteration cycles month → afternoon.
Related¶
- concepts/observability
- concepts/metric-type-metadata
- concepts/metric-temporality
- concepts/streaming-aggregation
- patterns/intent-preserving-query-translation
- patterns/alerts-as-code
- patterns/alert-backtesting
- patterns/achievable-target-first-migration
- patterns/dual-write-migration
- patterns/zero-injection-counter
- systems/vmagent