Skip to content

PATTERN Cited by 2 sources

Alerts as code

Treat each alert as a first-class software artifact: authored with IDE-style tooling, validated against historical data before deploy, diffed on change, and reviewed like code. Replaces the common legacy pattern of alerts as sparsely documented config files that nobody owns.

Required capabilities

  • Authoring UX. Autocomplete on metric names, labels, and operators; builder-style assistance for query construction.
  • Backtesting. Replay the alert expression against historical metric data to answer "when would this alert have fired in the last N weeks?". Catches over/under-sensitive thresholds before deploy.
  • Diffing. Show the behavioral change between two versions of an alert (what fires now that didn't before, and vice versa) so reviewers can reason about impact.
  • Deployable as code. Alert definitions live in version control or a structured store, are subject to review, and deploy via a CD pipeline — see patterns/git-based-config-workflow.
  • Centralized authoring surface. One framework across the org rather than per-team conventions; reduces migration/maintenance cost.

Why it matters

  • Legacy alerting (sparsely documented YAML / vendor web UI) leads to silent rot: nobody knows why an alert exists, who owns it, or what would happen if they changed it. Backtesting and diffing make alert changes reviewable, which is the precondition for them being owned.
  • During a platform migration, a centralized authoring surface reduces per-team migration work because teams aren't carrying custom per-team alert tooling across the cutover.

Tradeoffs

  • Upfront investment in authoring tooling, backtest engine, and diff UX.
  • Requires a stable metric catalog and type metadata to power autocomplete and backtests.
  • Some cultural friction from teams used to "just edit the alert in the UI" workflows.

Seen in

  • sources/2026-03-17-airbnb-observability-ownership-migration — Airbnb pulled forward a new alert-authoring framework (from their Reliability XP team) mid-migration. Framework provides code-style authoring, autocomplete, backtesting, and diffing. Initially they had planned to preserve alerts as-is; the switch materially increased the migration's perceived value and centralized authoring reduced per-team effort.
  • sources/2026-03-04-airbnb-alert-backtesting-change-reports — deep-dive on the same framework's three pillars: local-first development, Change Reports (auto-posted on every PR, bulk diff view with sortable "noisiness" + firing timelines), and bulk alert backtesting. Architectural discipline: compatibility over novelty (reuse Prometheus rule-group input format + rule-evaluation engine + time-series-block output rather than reimplementing), guardrails aren't optional (per-backtest K8s pod with autoscaling + concurrency limits + error thresholds + circuit breakers), perfect is the enemy of shipped (no recording-rule dependency resolver — UI affordance closes the gap), own the full surface area ("partial ownership creates leaky abstractions"). Unlocked a 300K-alert vendor → Prometheus migration, cut iteration cycles from month to afternoon, and drove a 90% reduction in company-wide alert noise.
Last updated · 200 distilled / 1,178 read