PATTERN Cited by 1 source

Alert backtesting¶

Replay a proposed alert expression against historical metric data to answer: "when would this alert have fired, and how often, if it had existed the whole time?" Applied at PR-diff granularity (hundreds or thousands of alerts per change) with per-alert quality signals ("noisiness", firing-count timeline), alert backtesting turns alert authoring from deploy-and-wait into a pre-merge workflow.

Shape¶

Input: the alert expression in the platform's native format (e.g. Prometheus rule groups), plus a time window (a one-week or 30-day lookback is typical) and historical metrics.
Execution: reuses the platform's own rule-evaluation engine unchanged — simulation = "what if this rule existed at t" for every t in the window. Each run isolated in its own execution unit (Kubernetes pod with autoscaling) to prevent resource contention with production.
Output: the simulated firings written back in the platform's native time-series format, queryable with the platform's standard query API.
Scoring: a "noisiness" metric per alert + a firing-count timeline; surfaced in the PR's Change Report so reviewers can sort by noisiness and focus on the problematic ones.
Dive-in UX: per-alert panel with the underlying metric graph + an override sandbox ("what if I changed the threshold from 1.5 to 1.14?"), which re-runs the simulation inline.

Required capabilities¶

Alerts-as-code substrate (see patterns/alerts-as-code) so the PR diff is the input to the backtest.
Historical metric retention long enough to cover a meaningful window (~30 days typical).
Reusable rule-evaluation engine — preferably the same one running production, so simulation ≡ production semantics. Implementing an independent simulator is an anti-pattern: divergence is silent and the signal becomes unreliable.
Isolation + guardrails: per-backtest compute pod, concurrency limits, error thresholds, circuit breakers. "A backtesting system that can destabilize production is worse than no system at all."
Diff-level batching: scale the UI from one alert to thousands per run, because platform-template changes fan out to many services.
Modified-dependency detection: when a PR touches a recording rule other alerts depend on, the UI must highlight the dependency and prompt a two-step flow (modify recording rule first, then backtest dependent alerts) — otherwise silent incorrect results.

Why it matters¶

Collapses the historical deploy-side-by-side-and-wait iteration loop from a month to an afternoon.
Makes platform-team alert template changes (fanning out to thousands of services) reviewable for the first time — without backtesting, reviewers have no basis to reason about behavior-at-rollout.
Unlocks fleet-wide migrations: Airbnb attributes their 300K-alert vendor → Prometheus migration becoming tractable primarily to the Change Report + bulk-backtest capability.
Aggregate behavioral effect: alert-hygiene debt becomes visible at author-time, so it gets paid down continuously instead of accumulating.

Tradeoffs¶

Build cost is real. Rule-evaluation-engine integration + pod-scoped execution + autoscaling + UI + noisiness scoring + override sandbox is a platform-team project, not a weekend hack.
Recording-rule dependencies: strict correctness would require a dependency resolver. Airbnb explicitly chose a UI affordance (highlight
prompt) over a resolver as a "perfect-is-enemy-of-shipped" simplification. This leaves a correctness edge case to operator discipline.
Historical data cost: long-retention metrics are expensive. Window length and cardinality directly drive backtest storage cost.
Not a substitute for staging. Backtesting validates behavior against past reality; it can't foresee new regimes (traffic shifts, new incident classes).

Seen in¶

sources/2026-03-04-airbnb-alert-backtesting-change-reports — Airbnb's Reliability Experience team built backtesting into the Change Report UI by hooking directly into Prometheus's rules/manager.go rule-evaluation engine. Backtest results persisted as Prometheus time series blocks, queryable via the standard range-query API. Each run in its own Kubernetes pod with autoscaling + concurrency limits + error thresholds + multiple circuit breakers. Bulk-at-diff with a "noisiness" metric + firing-timeline table. Typical window 30 days (one week shown in walkthrough screenshots). 300K alerts migrated from a vendor to Prometheus, 90% reduction in company-wide alert noise, iteration cycle month → afternoon.

patterns/alerts-as-code — alert backtesting is a capability within the broader alerts-as-code discipline; an alerts-as-code platform without backtesting is the thinner version.
patterns/ci-regression-budget-gate — same shape at a different surface: compute a cost/quality proxy from the PR diff and surface it to reviewers at pre-merge time.
systems/airbnb-observability-platform — canonical production instance.
concepts/observability — "own the interaction layer" sharpened: "partial ownership creates leaky abstractions."