CONCEPT Cited by 1 source
Multi-window multi-burn-rate (MWMBR) alerting¶
Multi-Window Multi-Burn-Rate (MWMBR) alerting is the Google-SRE-Workbook alerting strategy in which an SLO-backed alert fires only when error-budget burn rate exceeds a threshold across two simultaneous time windows — a short one (e.g. 5 minutes) to confirm the event is real and a longer one (e.g. 1 hour) to confirm it is material. Multiple such (short, long, burn_rate) tuples cover different severities (page on fast burn; ticket on slow burn).
Definition¶
From the Google SRE Workbook, with Zalando's operational instantiation:
"We soon evolved our Adaptive Paging alert handler to use the Multi Window Multi Burn Rate strategy which uses burn rates to define alert thresholds. The alerts went from being triggered whenever the error rate breached the SLO, to having the decision of whether a page should go out or not based on the rate we are burning the error budget for an operation." — sources/2022-04-27-zalando-operation-based-slos
Three defining properties:
- Fire on error-budget burn rate, not raw error rate. Burn rate n× means "at the current rate, we'd exhaust our budget in 1/n of the SLO window". A 99.9% SLO with 1× burn is sustainable; 14× means we'll burn the whole 28-day budget in 2 days.
- Two windows simultaneously. The short window filters false positives (transient spikes); the long window filters false negatives (the event has to have persisted long enough to matter).
- Multiple (short, long, burn_rate) tuples per alert. Cover different severities — fast burn → page; medium → urgent ticket; slow → ticket. Each tuple is one alert rule on the same SLO.
Why MWMBR over raw-rate¶
Raw-error-rate alerting has two classical failure modes MWMBR targets directly:
- False positives on transient spikes. A 30-second blip pushes error rate > SLO and pages on-call; no real impact, but the budget is barely touched. MWMBR's short+long window requirement means the spike has to persist in both to fire. Zalando's post narrates this exact failure: "multiple occasions of short lived error spikes that resulted in pages."
- False negatives on slow regressions. A silent bug causes 2× normal error rate for 3 days — never crosses the SLO threshold in any single time window, but quietly burns the entire month's budget. Raw-rate alerting misses it entirely. MWMBR's slow-burn tuple (e.g. 1× over 3 days) catches it.
Canonical Google SRE Workbook thresholds¶
The Workbook recommends these tuples as defaults (subject to per-SLO tuning):
| Severity | Long window | Short window | Burn rate | Alert destination |
|---|---|---|---|---|
| Fast burn | 1h | 5min | 14× | Page (critical) |
| Medium burn | 6h | 30min | 6× | Page (urgent) |
| Slow burn | 3d | 6h | 1× | Ticket |
At 14× burn, you'd exhaust a 28-day budget in ~2 days; 14× × 1h / 28d × 100% = 2% of budget consumed in 1h — clearly outage-scale.
Prerequisites¶
- An explicit error budget. Derived from an SLO over a definite window (Google canon: 28 days). No budget = no burn-rate = no MWMBR alerting.
- An SLI time-series with enough resolution. 5-minute and 1-hour windows both need to be computable; the raw SLI typically needs ≤1-minute resolution with at least 28-day retention.
- Alerting evaluator that can combine multiple expressions. The alert's boolean combines "burn rate in short window > X AND burn rate in long window > X"; many classic alerting backends (Nagios-era) cannot express this cleanly. Most modern systems (Prometheus, Datadog, Cloudwatch) can.
- A policy on what to do when the alert fires. Fast burn pages; slow burn tickets; out-of-the-box MWMBR without a routing policy doesn't reduce fatigue by itself.
Interaction with symptom-based alerting¶
MWMBR closes one of the main gaps in symptom-based alerting: "if we alert on the symptom, how do we avoid paging on every 5-second blip?" Before MWMBR, symptom-alerts either had to carry per-alert fine-tuning (time-of-day gates, throughput guards, persistence durations — Zalando's pre-MWMBR mess) or accept false positives. MWMBR's structure handles this at the alert-rule altitude, freeing each SLO from per-alert tuning: "engineering teams required no effort to set up and manage these alerts."
Interaction with adaptive paging¶
Zalando's Adaptive Paging alert handler evolved from triggering on raw CBO error rate (2019) to triggering on CBO budget burn-rate (2022) via MWMBR. The routing logic (trace-graph traversal → team closest to root cause) is unchanged; only the trigger-predicate became MWMBR. This is the canonical composition: MWMBR for when to page, adaptive paging for whom.
Anti-patterns¶
- Single-window burn-rate alerting. Missing the short- window check: fires on every transient spike. Missing the long-window check: fails to distinguish outages from regressions.
- Budget without burn-rate alerting. Tracking budget consumption daily (e.g. in a dashboard) but alerting on raw rate: budget becomes a post-hoc metric, not a trigger.
- Too many severity tiers. 6 tuples per SLO generates overlapping alerts that all fire on the same event. Stick to ≤3 severities.
- Not tuning thresholds to traffic. On very-low-traffic services, the short window may have no requests. 5-minute windows on 0.1 RPS services are mathematically useless — extend the short window or switch to availability-ratio alerting when there's enough data.
- Per-team burn rate customisation. Each team picking its own (short, long, burn_rate) tuples recreates the fine- tuning problem. Standardise at the platform layer.
Seen in¶
- sources/2022-04-27-zalando-operation-based-slos — canonical Zalando instantiation. Explicit narrative of moving from raw-rate (adaptive paging v1, 2019) to MWMBR (v2, 2022); the structural reason given is false-positive elimination "without fine tuning." Direct reference to the Google SRE Workbook chapter as the threshold source.
Related¶
- concepts/error-budget — the load-bearing primitive MWMBR operationalises.
- concepts/service-level-objective — from which both the budget and the windows are derived.
- concepts/operation-based-slo — the SLO altitude at which Zalando applies MWMBR (CBOs, not services).
- concepts/critical-business-operation — the alertable primitive whose MWMBR burn Zalando monitors.
- concepts/symptom-based-alerting — the alerting strategy MWMBR makes fine-tuning-free.
- concepts/adaptive-paging — the routing primitive MWMBR composes with.
- concepts/alert-fatigue — the problem MWMBR targets.
- systems/google-sre-book