Airbnb: From vendors to vanguard — hard-won lessons in observability ownership¶
Summary¶
Airbnb describes a five-year migration from a third-party, vendor-managed
metrics platform to an in-house observability platform built on Prometheus /
PromQL, covering instrumentation, collection, storage, and visualization.
They moved 300M timeseries, 3,100 dashboards, and 300,000+ alerts across
1,000 services. The article is primarily a migration strategy retrospective
— contrasting a failed "v1" approach (start with the hardest service, preserve
every legacy behavior, rely on documentation) against a successful "v2"
(start with an easy well-aligned service, migrate intent of queries not
literal queries, invest in a new alert-authoring framework mid-migration,
own the interaction layer). A side architectural contribution is a metadata
engine layered into the translation system that uses an internal label
(_otel_metric_type_) to map each metric to its type (counter / histogram /
gauge) since they preserved legacy metric names instead of renaming to
Prometheus conventions.
Key takeaways¶
- Motivation: vendor incentives misaligned with user needs. Vendors price on ingested data volume, so costs rise with telemetry growth but more data does not reduce MTTD/MTTR. Being outside the feedback loop of how observability data is consumed also blocked both UX improvements and cost optimizations. (Source: sources/2026-03-17-airbnb-observability-ownership-migration)
- Don't start migration with the hardest service. The instinct to tackle the biggest/most complex service first ("Everest on day one") to prove viability backfires: the team burns cycles on edge cases, hits false alarms, and ships dashboards that don't line up, before anyone trusts the new system. Start with a tractable, well-aligned service instead — enough to validate scale, tooling, and UX with real users but not so much that you drown in translation quirks.
- Migrate the intent of queries, not the literal queries. Over years, dashboards accumulate quietly-wrong metrics (averages instead of p95, sums of latency, etc.). A migration is the rare chance to fix these, but only by mapping intent: e.g., any query asking for a p95 becomes a canonical histogram query, regardless of the (often wrong) aggregations layered on top in the old dashboard.
- Metric-type metadata engine is required when names are preserved.
Prometheus infers types from naming conventions (e.g.,
_total= counter). Airbnb preserved legacy metric names to keep code↔telemetry in sync, so naming-based inference is unreliable. They built a metadata engine into the translation layer that periodically scans all metrics and uses_otel_metric_type_(internal label) to maintain a reliable metric→type map, which powers correct translation and AI-generated PromQL. - Adopt PromQL, then pair with AI tooling seeded with metadata. Rather than keep a permanent compatibility shim that would prevent users from learning the new query language, they accepted the short-term cost of PromQL unfamiliarity and mitigated it with in-house AI skills that consume the metric-type/unit metadata to generate correct PromQL. Common tasks (incident diagnosis, dashboard creation) went from hours to minutes.
- Pull forward an alerts-as-code authoring framework mid-migration. The original plan was to preserve dashboards and alerts as-is. They changed course when it became clear the legacy alerting was holding people back. The new framework treats each alert as a development workflow: authored as code, autocomplete/builder-style query help, backtesting (when would this alert have fired historically?), diffing of changes before deploy. Centralizing alert authoring also reduced per-team migration work.
- Own the interaction layer, not just the backend. The single biggest lesson: visualization tool, alert authoring, and dashboard workflows are where switching cost lives. Had Airbnb already owned those touchpoints, the backend migration (storage engine + query language) would have been far simpler and more incremental. Even teams not migrating should invest in owning the frontend/authoring layer now to reduce future friction.
- Automation is necessary but not sufficient. Automated translators moved 300K alerts / 3.1K dashboards, but blind translation just transfers legacy tech debt into the new system. The high-leverage moves were the deliberate compatibility breaks — canonical histogram queries, new alert framework — accepted because they produced better defaults.
Architectural facts & numbers¶
- Scale migrated: 1,000 services, 300M timeseries, 3,100 dashboards, 300,000+ alerts.
- Duration: ~5 years, plans evolved significantly mid-flight.
- Query language: PromQL (chosen for mature ecosystem + LLM familiarity).
- Metric-type source of truth: internal label
_otel_metric_type_per metric, scanned periodically by a metadata engine in the translation layer. - Full stack replaced: instrumentation + collection + storage + visualization + alerting.
Systems / concepts / patterns extracted¶
- Systems: Airbnb in-house observability platform (new), Prometheus (storage/query), PromQL (query language), OpenTelemetry (metadata labels), new alert-authoring framework from the Reliability XP team.
- Concepts: observability (metrics/logs/traces triad), metric-type metadata (explicit type labels instead of naming conventions), observability ownership (own the interaction layer, not just the backend), feedback loop between platform team and users of telemetry.
- Patterns: intent-preserving query translation (map what the user was trying to measure, not the literal query), alerts-as-code (author/review/diff/backtest alerts like code), achievable-target-first migration (start with a tractable well-aligned workload, not the hardest one), own-the-interaction-layer (migrations are cheaper when you already control the UX surfaces users touch).
Caveats¶
- No quantitative post/pre comparison of MTTD, MTTR, cost, or alert quality — the article is a qualitative retrospective.
- No detail on the storage engine beyond "Prometheus-based"; no discussion of cardinality limits, long-term storage, downsampling, or multi-region topology.
- No detail on the data-collection or ingest pipeline architecture (sharding, replication, write path).
- Article frames migration lessons universally, but Airbnb's scale and 5-year runway won't generalize to smaller orgs — a ~1-service, 3-month migration has different tradeoffs than a 1,000-service, 5-year one.
- The new alert framework and AI PromQL-generation tooling are mentioned at a high level but not benchmarked.
Links¶
- Raw:
raw/airbnb/2026-03-17-from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observa-fd94d44b.md - Original: https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3