Skip to content

PATTERN Cited by 1 source

Alert backtesting

Replay a proposed alert expression against historical metric data to answer: "when would this alert have fired, and how often, if it had existed the whole time?" Applied at PR-diff granularity (hundreds or thousands of alerts per change) with per-alert quality signals ("noisiness", firing-count timeline), alert backtesting turns alert authoring from deploy-and-wait into a pre-merge workflow.

Shape

  • Input: the alert expression in the platform's native format (e.g. Prometheus rule groups), plus a time window (a one-week or 30-day lookback is typical) and historical metrics.
  • Execution: reuses the platform's own rule-evaluation engine unchanged — simulation = "what if this rule existed at t" for every t in the window. Each run isolated in its own execution unit (Kubernetes pod with autoscaling) to prevent resource contention with production.
  • Output: the simulated firings written back in the platform's native time-series format, queryable with the platform's standard query API.
  • Scoring: a "noisiness" metric per alert + a firing-count timeline; surfaced in the PR's Change Report so reviewers can sort by noisiness and focus on the problematic ones.
  • Dive-in UX: per-alert panel with the underlying metric graph + an override sandbox ("what if I changed the threshold from 1.5 to 1.14?"), which re-runs the simulation inline.

Required capabilities

  • Alerts-as-code substrate (see patterns/alerts-as-code) so the PR diff is the input to the backtest.
  • Historical metric retention long enough to cover a meaningful window (~30 days typical).
  • Reusable rule-evaluation engine — preferably the same one running production, so simulation ≡ production semantics. Implementing an independent simulator is an anti-pattern: divergence is silent and the signal becomes unreliable.
  • Isolation + guardrails: per-backtest compute pod, concurrency limits, error thresholds, circuit breakers. "A backtesting system that can destabilize production is worse than no system at all."
  • Diff-level batching: scale the UI from one alert to thousands per run, because platform-template changes fan out to many services.
  • Modified-dependency detection: when a PR touches a recording rule other alerts depend on, the UI must highlight the dependency and prompt a two-step flow (modify recording rule first, then backtest dependent alerts) — otherwise silent incorrect results.

Why it matters

  • Collapses the historical deploy-side-by-side-and-wait iteration loop from a month to an afternoon.
  • Makes platform-team alert template changes (fanning out to thousands of services) reviewable for the first time — without backtesting, reviewers have no basis to reason about behavior-at-rollout.
  • Unlocks fleet-wide migrations: Airbnb attributes their 300K-alert vendor → Prometheus migration becoming tractable primarily to the Change Report + bulk-backtest capability.
  • Aggregate behavioral effect: alert-hygiene debt becomes visible at author-time, so it gets paid down continuously instead of accumulating.

Tradeoffs

  • Build cost is real. Rule-evaluation-engine integration + pod-scoped execution + autoscaling + UI + noisiness scoring + override sandbox is a platform-team project, not a weekend hack.
  • Recording-rule dependencies: strict correctness would require a dependency resolver. Airbnb explicitly chose a UI affordance (highlight
  • prompt) over a resolver as a "perfect-is-enemy-of-shipped" simplification. This leaves a correctness edge case to operator discipline.
  • Historical data cost: long-retention metrics are expensive. Window length and cardinality directly drive backtest storage cost.
  • Not a substitute for staging. Backtesting validates behavior against past reality; it can't foresee new regimes (traffic shifts, new incident classes).

Seen in

  • sources/2026-03-04-airbnb-alert-backtesting-change-reports — Airbnb's Reliability Experience team built backtesting into the Change Report UI by hooking directly into Prometheus's rules/manager.go rule-evaluation engine. Backtest results persisted as Prometheus time series blocks, queryable via the standard range-query API. Each run in its own Kubernetes pod with autoscaling + concurrency limits + error thresholds + multiple circuit breakers. Bulk-at-diff with a "noisiness" metric + firing-timeline table. Typical window 30 days (one week shown in walkthrough screenshots). 300K alerts migrated from a vendor to Prometheus, 90% reduction in company-wide alert noise, iteration cycle month → afternoon.
Last updated · 200 distilled / 1,178 read