PATTERN Cited by 1 source
SLO-derived alert rule generation¶
SLO-derived alert rule generation is the pattern of deriving alert window lengths and burn-rate thresholds automatically from an operation's SLO, so that defining an SLO gives you a working alert rule without any per-operation tuning. It is the structural answer to why concepts/multi-window-multi-burn-rate alerting beat Zalando's original SLO-threshold paging.
Problem¶
The first iteration of Zalando's
Adaptive Paging alert handler
used the SLO target as the paging threshold — if the
operation's error rate exceeded 1 - SLO, the handler
paged. The promise was "engineers define SLOs, alert rules
come for free."
That promise broke immediately:
"Our initial approach was to make the SLO the threshold upon which we would page the on-call responder. What we soon discovered is that it made our alerts too sensitive to occasional short lived spikes, similar to any other non-Adaptive Paging alert. We mitigated this by providing additional criteria that engineers could use to more granularly control the alert itself (time of day, throughput, length of the error rate)." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
The mitigation re-created the problem: "Engineers were back at defining alerting rules because the target set by the SLO was not enough." Hands-off alerting became hands- on alerting.
Why SLO-as-threshold paging fails¶
An SLO target (e.g. 99.9% over 28 days) is a long- horizon contract, not an instantaneous threshold. Over a 1-minute window, the error rate almost always violates 99.9% (because the denominator is small), while the long- horizon budget is fine. Paging on momentary breaches:
- Generates false positives on routine spikes and cold starts.
- Can miss slow burns that don't spike over the threshold but exhaust the budget over days.
- Forces engineers to add per-operation hacks (only page during business hours, only page if throughput > X, only page if error rate is elevated for Y minutes) — the exact tuning the SLO-target-as-threshold was supposed to avoid.
The pattern¶
Replace the threshold-matching handler with a handler that:
- Takes the SLO target as input. (e.g. 99.9% over 28 days ⇒ 0.1% error budget.)
- Chooses (short-window, long-window, burn-rate) tuples from a standard table (Google SRE Workbook Section 6 gives 2-hour/1-hour + 5-minute/1-minute tuples for fast and slow burns).
- Computes the alert condition — burn rate × time window × SLO target = budget fraction consumed; fire when that exceeds threshold over both windows.
- Pages only when the error budget is actually at risk — not when the SLO is momentarily dipped.
Zalando's outcome:
"We made it possible for the operations guarded by our alert handler to have their respective rules (length of the sliding windows and the alarm threshold) derived automatically from the SLO without any effort from the engineering teams, which was usually done through trial and error." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
Implementation shape (Zalando)¶
- systems/zalando-adaptive-paging consumes an operation's SLO definition (from the systems/zalando-service-level-management-tool) and generates MWMBR alert rules automatically.
- No per-operation threshold configuration for the common case. Operations with unusual characteristics (very low traffic, highly seasonal) still get overrides, but the default is zero config.
- Error budget becomes the paging primitive. A semantic shift: the handler pages on budget burn rate, not raw SLO breach. See concepts/error-budget.
Why it works¶
- Burn rate is the right quantity for alerting. Burn rate = (error rate) / (1 - SLO). Burn rate of N× means "at this rate, we'd exhaust the budget in 1/N of the SLO window." Two-hour burn rate is cleanly comparable across operations.
- Two windows eliminate spike false positives. A short window alone alerts on every spike; a long window alone alerts slowly. Requiring both to breach filters spikes.
- Threshold library covers severity tiers. Fast-burn threshold pages immediately; slow-burn threshold opens tickets. Both derived from the same SLO.
- Zero per-operation tuning scales. Defining an SLO is a product-manager conversation; defining an alert rule is an on-call-engineer conversation. Unifying them reduces the second conversation to zero.
Preconditions¶
- SLO tooling must exist and cover the operation. Without a Service Level Management tool where SLOs are first- class data, the alert handler has nothing to read.
- Operation-based SLOs, not service-based. Per- operation SLOs map cleanly to user journeys; per-service SLOs don't and produce noisier alerts.
- Consistent error attribution. The handler must be able to compute error rate per operation — typically via distributed tracing spans with success/failure attributes.
Caveats¶
- Low-traffic operations are noisy. Burn rate denominators at low QPS produce wild fluctuations. Some operations need a minimum-volume guard.
- Seasonal patterns may still need overrides. A weekend-idle service can technically burn 100× on a single 10 AM Saturday error without being an incident. Time-of-day overrides remain an escape hatch.
- Doesn't eliminate alert fatigue if SLOs are wrong. If the SLO target is unrealistically tight, burn-rate alerting fires constantly. Garbage-in-garbage-out.
Seen in¶
- sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's 2020 upgrade of Adaptive Paging: SLO- threshold paging replaced with MWMBR threshold calculation derived automatically from the SLO. "The length of the sliding windows and the alarm threshold derived automatically from the SLO without any effort from the engineering teams."