PATTERN Cited by 1 source
Alert backtesting¶
Replay a proposed alert expression against historical metric data to answer: "when would this alert have fired, and how often, if it had existed the whole time?" Applied at PR-diff granularity (hundreds or thousands of alerts per change) with per-alert quality signals ("noisiness", firing-count timeline), alert backtesting turns alert authoring from deploy-and-wait into a pre-merge workflow.
Shape¶
- Input: the alert expression in the platform's native format (e.g. Prometheus rule groups), plus a time window (a one-week or 30-day lookback is typical) and historical metrics.
- Execution: reuses the platform's own rule-evaluation engine
unchanged — simulation = "what if this rule existed at
t" for everytin the window. Each run isolated in its own execution unit (Kubernetes pod with autoscaling) to prevent resource contention with production. - Output: the simulated firings written back in the platform's native time-series format, queryable with the platform's standard query API.
- Scoring: a "noisiness" metric per alert + a firing-count timeline; surfaced in the PR's Change Report so reviewers can sort by noisiness and focus on the problematic ones.
- Dive-in UX: per-alert panel with the underlying metric graph + an override sandbox ("what if I changed the threshold from 1.5 to 1.14?"), which re-runs the simulation inline.
Required capabilities¶
- Alerts-as-code substrate (see patterns/alerts-as-code) so the PR diff is the input to the backtest.
- Historical metric retention long enough to cover a meaningful window (~30 days typical).
- Reusable rule-evaluation engine — preferably the same one running production, so simulation ≡ production semantics. Implementing an independent simulator is an anti-pattern: divergence is silent and the signal becomes unreliable.
- Isolation + guardrails: per-backtest compute pod, concurrency limits, error thresholds, circuit breakers. "A backtesting system that can destabilize production is worse than no system at all."
- Diff-level batching: scale the UI from one alert to thousands per run, because platform-template changes fan out to many services.
- Modified-dependency detection: when a PR touches a recording rule other alerts depend on, the UI must highlight the dependency and prompt a two-step flow (modify recording rule first, then backtest dependent alerts) — otherwise silent incorrect results.
Why it matters¶
- Collapses the historical deploy-side-by-side-and-wait iteration loop from a month to an afternoon.
- Makes platform-team alert template changes (fanning out to thousands of services) reviewable for the first time — without backtesting, reviewers have no basis to reason about behavior-at-rollout.
- Unlocks fleet-wide migrations: Airbnb attributes their 300K-alert vendor → Prometheus migration becoming tractable primarily to the Change Report + bulk-backtest capability.
- Aggregate behavioral effect: alert-hygiene debt becomes visible at author-time, so it gets paid down continuously instead of accumulating.
Tradeoffs¶
- Build cost is real. Rule-evaluation-engine integration + pod-scoped execution + autoscaling + UI + noisiness scoring + override sandbox is a platform-team project, not a weekend hack.
- Recording-rule dependencies: strict correctness would require a dependency resolver. Airbnb explicitly chose a UI affordance (highlight
- prompt) over a resolver as a "perfect-is-enemy-of-shipped" simplification. This leaves a correctness edge case to operator discipline.
- Historical data cost: long-retention metrics are expensive. Window length and cardinality directly drive backtest storage cost.
- Not a substitute for staging. Backtesting validates behavior against past reality; it can't foresee new regimes (traffic shifts, new incident classes).
Seen in¶
- sources/2026-03-04-airbnb-alert-backtesting-change-reports —
Airbnb's Reliability Experience team built backtesting into the Change
Report UI by hooking directly into Prometheus's
rules/manager.gorule-evaluation engine. Backtest results persisted as Prometheus time series blocks, queryable via the standard range-query API. Each run in its own Kubernetes pod with autoscaling + concurrency limits + error thresholds + multiple circuit breakers. Bulk-at-diff with a "noisiness" metric + firing-timeline table. Typical window 30 days (one week shown in walkthrough screenshots). 300K alerts migrated from a vendor to Prometheus, 90% reduction in company-wide alert noise, iteration cycle month → afternoon.
Related¶
- patterns/alerts-as-code — alert backtesting is a capability within the broader alerts-as-code discipline; an alerts-as-code platform without backtesting is the thinner version.
- patterns/ci-regression-budget-gate — same shape at a different surface: compute a cost/quality proxy from the PR diff and surface it to reviewers at pre-merge time.
- systems/airbnb-observability-platform — canonical production instance.
- concepts/observability — "own the interaction layer" sharpened: "partial ownership creates leaky abstractions."