Airbnb: It wasn't a culture problem — upleveling alert development at Airbnb¶
Summary¶
Follow-up to the 2026-03-17 observability-migration post by Airbnb's Reliability Experience team, going deep on the alert-authoring platform itself. Reframes what looked like an engineering-culture problem ("nobody cares about alert hygiene") as a workflow problem: the legacy deploy-wait-iterate cycle made changing alerts so painful that engineers rationally avoided it. The platform's three pillars — local-first development, Change Reports, and bulk backtesting on historical data — collapsed that cycle. Measured outcomes: 300,000 alerts migrated from a vendor to Prometheus, month-long iteration cycles compressed to an afternoon, and company-wide alert noise reduced by 90%.
Key takeaways¶
- Alert backtesting as the load-bearing primitive. Proposed alerts are evaluated against historical metric data to answer "how would this alert have behaved in production?" — before it ships. Implemented by hooking directly into Prometheus's rule manager and writing results out as Prometheus time series blocks queryable via the standard range-query API. (Source: sources/2026-03-04-airbnb-alert-backtesting-change-reports)
- Bulk-at-diff-granularity + a noisiness metric. Every PR change to
alerts-as-code gets a Change Report that backtests the entire
diff — hundreds or thousands of related alerts at once — and surfaces
a computed "noisiness" metric + firing timelines in a table view,
letting reviewers sort and focus on the problematic ones instead of
reading thousands of
yamllines. This is how a platform-team template change affecting thousands of services becomes reviewable. - Compatibility over novelty. A deliberate architectural call: take Prometheus rule groups as input (standard format), reuse Prometheus's rule-evaluation engine instead of reimplementing it, write output as Prometheus time series blocks, expose via the standard query API. Building on existing abstractions meant the backtest results were queryable with the same tools as production metrics — analysis UI built once — and the system was portable into every developer's existing workflow.
- Guardrails aren't optional for a backtest service. Simulating thousands of alerts over 30 days quickly, without destabilizing production, requires: per-backtest Kubernetes pod with autoscaling (resource isolation), concurrency limits, error thresholds, and multiple circuit breakers. "A backtesting system that can destabilize production is worse than no system at all."
- Perfect is the enemy of shipped. The simulator doesn't resolve recording-rule dependencies. The fix wasn't a sophisticated resolver — it was a UI that highlights modified dependencies and prompts resolution, turning a technical limitation into a guided two-step workflow (modify recording rule first, then backtest dependent alerts). Ships the 80% solution with UX closing the gap.
- Own the full surface area. "Abstractions only simplify things when you own all the touchpoints: the input language engineers write, the generation process, the UI that displays results, and the validation tools that provide feedback. Partial ownership creates leaky abstractions." Extends the observability-migration post's "own-the-interaction-layer" lesson with a second frame — partial ownership of an abstraction surface is worse than no abstraction because users then debug through all the layers you don't control.
- 300K-alert vendor → Prometheus migration unlocked. Rewriting 300K alerts manually was structurally impossible under the old workflow. It became tractable with the Change Report UI + bulk backtesting + a vendor-specific import integration — described as "a structured, confident migration" replacing what had been expected to be a "multi-year slog."
- Culture transformation followed the workflow fix, not vice versa. 90% reduction in company-wide alert noise; engineers "started competing to improve alerts"; platform teams resumed iterating on shared alert templates. The team's frame: it looked like a culture problem, but a workflow problem that made alert-tuning cost-prohibitive was driving the learned helplessness.
Architectural facts & numbers¶
- Alerts migrated from a vendor to Prometheus: 300,000.
- Company-wide alert-noise reduction: 90%.
- Iteration-cycle compression: ~1 month → ~1 afternoon for making and validating alert changes within a single PR.
- Backtest-engine integration point: Prometheus
rules/manager.go(specific commit linked in the post). - Backtest-output format: Prometheus time series blocks, queryable via the standard range-query API.
- Backtest execution environment: per-backtest Kubernetes pod with autoscaling; concurrency limits + error thresholds + circuit breakers.
- Backtest scope: full-diff, hundreds to thousands of related alerts per Change Report.
- Typical backtest window: 30 days of historical data (one-week backtest shown in the walkthrough screenshots).
- Signals surfaced per alert: computed "noisiness" metric; firing-count timeline in the table view; dive-in inspection per alert; override-value sandbox (try a new threshold, see simulated firing).
- Change Report delivery: CLI, CI, and auto-posted on PRs.
Systems / concepts / patterns extracted¶
- Systems: systems/airbnb-observability-platform — extends the platform story with the Change Report + bulk-backtest backend; systems/prometheus — reuses its rule-evaluation engine unchanged.
- Concepts: concepts/observability — own-the-interaction-layer sharpened with "partial ownership creates leaky abstractions".
- Patterns: patterns/alerts-as-code — deepened with the three-pillar stack (local-first / Change Reports / bulk backtesting), the compatibility-over-novelty architectural note, the guardrail discipline; patterns/alert-backtesting — new, first-class pattern for replaying proposed alerts against historical metric data at PR-diff granularity, with noisiness scoring + per-alert inspection.
Caveats¶
- Raw article capture is truncated at the top — starts mid-sentence inside the backtesting-system paragraph, so earlier sections (motivation, "local-first development" detail, "Change Report" design detail) are referenced only via recap. The pillar naming ("Local-first development, Change Reports, and bulk backtesting") is explicit in the conclusion.
- No quantitative latency numbers for the backtest engine itself (how long a 30-day × 1,000-alert diff takes to run).
- No recording-rule-dependency-resolver architecture (the team chose a UI affordance over a resolver, no plans disclosed).
- No discussion of how alerts-as-code source lives in git (repo layout, review gates, lint tooling).
- No discussion of how the noisiness metric is computed (firing rate? time-weighted? distribution over the window?).
- Quote from Gregory Szorc (Senior Staff Software Engineer) is the only cited user voice; no broader developer-survey data.
Links¶
- Raw:
raw/airbnb/2026-03-04-it-wasnt-a-culture-problem-upleveling-alert-development-at-a-b220dde7.md - Original: https://medium.com/airbnb-engineering/it-wasnt-a-culture-problem-upleveling-alert-development-at-airbnb-01e2290eb0f5