Skip to content

AIRBNB 2026-03-17 Tier 2

Read original ↗

Airbnb: From vendors to vanguard — hard-won lessons in observability ownership

Summary

Airbnb describes a five-year migration from a third-party, vendor-managed metrics platform to an in-house observability platform built on Prometheus / PromQL, covering instrumentation, collection, storage, and visualization. They moved 300M timeseries, 3,100 dashboards, and 300,000+ alerts across 1,000 services. The article is primarily a migration strategy retrospective — contrasting a failed "v1" approach (start with the hardest service, preserve every legacy behavior, rely on documentation) against a successful "v2" (start with an easy well-aligned service, migrate intent of queries not literal queries, invest in a new alert-authoring framework mid-migration, own the interaction layer). A side architectural contribution is a metadata engine layered into the translation system that uses an internal label (_otel_metric_type_) to map each metric to its type (counter / histogram / gauge) since they preserved legacy metric names instead of renaming to Prometheus conventions.

Key takeaways

  1. Motivation: vendor incentives misaligned with user needs. Vendors price on ingested data volume, so costs rise with telemetry growth but more data does not reduce MTTD/MTTR. Being outside the feedback loop of how observability data is consumed also blocked both UX improvements and cost optimizations. (Source: sources/2026-03-17-airbnb-observability-ownership-migration)
  2. Don't start migration with the hardest service. The instinct to tackle the biggest/most complex service first ("Everest on day one") to prove viability backfires: the team burns cycles on edge cases, hits false alarms, and ships dashboards that don't line up, before anyone trusts the new system. Start with a tractable, well-aligned service instead — enough to validate scale, tooling, and UX with real users but not so much that you drown in translation quirks.
  3. Migrate the intent of queries, not the literal queries. Over years, dashboards accumulate quietly-wrong metrics (averages instead of p95, sums of latency, etc.). A migration is the rare chance to fix these, but only by mapping intent: e.g., any query asking for a p95 becomes a canonical histogram query, regardless of the (often wrong) aggregations layered on top in the old dashboard.
  4. Metric-type metadata engine is required when names are preserved. Prometheus infers types from naming conventions (e.g., _total = counter). Airbnb preserved legacy metric names to keep code↔telemetry in sync, so naming-based inference is unreliable. They built a metadata engine into the translation layer that periodically scans all metrics and uses _otel_metric_type_ (internal label) to maintain a reliable metric→type map, which powers correct translation and AI-generated PromQL.
  5. Adopt PromQL, then pair with AI tooling seeded with metadata. Rather than keep a permanent compatibility shim that would prevent users from learning the new query language, they accepted the short-term cost of PromQL unfamiliarity and mitigated it with in-house AI skills that consume the metric-type/unit metadata to generate correct PromQL. Common tasks (incident diagnosis, dashboard creation) went from hours to minutes.
  6. Pull forward an alerts-as-code authoring framework mid-migration. The original plan was to preserve dashboards and alerts as-is. They changed course when it became clear the legacy alerting was holding people back. The new framework treats each alert as a development workflow: authored as code, autocomplete/builder-style query help, backtesting (when would this alert have fired historically?), diffing of changes before deploy. Centralizing alert authoring also reduced per-team migration work.
  7. Own the interaction layer, not just the backend. The single biggest lesson: visualization tool, alert authoring, and dashboard workflows are where switching cost lives. Had Airbnb already owned those touchpoints, the backend migration (storage engine + query language) would have been far simpler and more incremental. Even teams not migrating should invest in owning the frontend/authoring layer now to reduce future friction.
  8. Automation is necessary but not sufficient. Automated translators moved 300K alerts / 3.1K dashboards, but blind translation just transfers legacy tech debt into the new system. The high-leverage moves were the deliberate compatibility breaks — canonical histogram queries, new alert framework — accepted because they produced better defaults.

Architectural facts & numbers

  • Scale migrated: 1,000 services, 300M timeseries, 3,100 dashboards, 300,000+ alerts.
  • Duration: ~5 years, plans evolved significantly mid-flight.
  • Query language: PromQL (chosen for mature ecosystem + LLM familiarity).
  • Metric-type source of truth: internal label _otel_metric_type_ per metric, scanned periodically by a metadata engine in the translation layer.
  • Full stack replaced: instrumentation + collection + storage + visualization + alerting.

Systems / concepts / patterns extracted

  • Systems: Airbnb in-house observability platform (new), Prometheus (storage/query), PromQL (query language), OpenTelemetry (metadata labels), new alert-authoring framework from the Reliability XP team.
  • Concepts: observability (metrics/logs/traces triad), metric-type metadata (explicit type labels instead of naming conventions), observability ownership (own the interaction layer, not just the backend), feedback loop between platform team and users of telemetry.
  • Patterns: intent-preserving query translation (map what the user was trying to measure, not the literal query), alerts-as-code (author/review/diff/backtest alerts like code), achievable-target-first migration (start with a tractable well-aligned workload, not the hardest one), own-the-interaction-layer (migrations are cheaper when you already control the UX surfaces users touch).

Caveats

  • No quantitative post/pre comparison of MTTD, MTTR, cost, or alert quality — the article is a qualitative retrospective.
  • No detail on the storage engine beyond "Prometheus-based"; no discussion of cardinality limits, long-term storage, downsampling, or multi-region topology.
  • No detail on the data-collection or ingest pipeline architecture (sharding, replication, write path).
  • Article frames migration lessons universally, but Airbnb's scale and 5-year runway won't generalize to smaller orgs — a ~1-service, 3-month migration has different tradeoffs than a 1,000-service, 5-year one.
  • The new alert framework and AI PromQL-generation tooling are mentioned at a high level but not benchmarked.

Source

Last updated · 200 distilled / 1,178 read