CONCEPT Cited by 1 source
Error budget¶
The error budget is the complement of an SLO: if the SLO target is 99.9% over a 28-day window, the error budget is the 0.1% of traffic the service is allowed to fail. The budget is a first-class operational resource — consumed by outages, bugs, and bad deploys; preserved by reliability work — and its depletion drives both alerting decisions (see concepts/multi-window-multi-burn-rate) and feature-velocity decisions (feature freezes when the budget is exhausted).
Definition¶
The Google SRE book's original framing: "The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow." Zalando operationalises this at the CBO altitude:
"The Error Budget became much more relevant with this change. The alerts went from being triggered whenever the error rate breached the SLO, to having the decision of whether a page should go out or not based on the rate we are burning the error budget for an operation." — sources/2022-04-27-zalando-operation-based-slos
Three properties:
- Derived, not chosen. Budget = (1 − SLO_target) × total traffic over the window. Once the SLO is agreed, the budget is arithmetic; no separate negotiation.
- Consumable. Every failed request debits the budget. Outages consume a lot quickly; slow regressions consume a little over time.
- Decision-driving. Budget state (healthy / burning / exhausted) is the primary input to reliability-vs-velocity decisions and to MWMBR alert thresholds.
Why the error budget matters¶
Without it, SLOs are metrics that get reported but don't drive behaviour:
- Feature-velocity vs reliability becomes political. Teams argue over "is this outage bad enough to stop shipping?" without an objective cutoff. With a budget, exhaustion is the cutoff.
- Alert thresholds become intuition-driven. Without a budget, engineers pick "alert when error rate > 0.5%" by gut. With the budget, the alert fires when the current burn rate would exhaust the budget within some multiple of the SLO window — a principled threshold.
- Short-term spikes and long-term regressions aren't distinguishable. Raw-rate alerting catches both or neither. Budget-burn-rate alerting can distinguish them — fast burn = outage = page; slow burn = silent regression = ticket.
Error-budget policies¶
The policy is the rule for what happens when the budget is low / exhausted. Common shapes:
- Feature freeze on exhaustion. The team stops shipping net-new features and spends 100% on reliability work until the budget recovers. Canonical Google policy.
- Deploy-gating. Risky deploys are delayed; lower-risk fixes and reliability work proceed.
- Cross-team escalation. For multi-service / multi-team operations (e.g. a CBO), budget depletion triggers a cross- team response rather than just the nearest team's freeze.
Zalando's instantiation at the CBO altitude means the policy is inherited by every team touching the operation — so a single CBO's exhaustion can gate feature velocity across 15+ services whose changes could further hurt the CBO.
Budget consumption over multiple windows¶
Most orgs track the budget over multiple simultaneous windows:
- Long window (28 days): the authoritative SLO-compliance window; matches the SLO's definition.
- Short windows (1h, 6h): detect fast burn — an ongoing outage is exhausting this hour's slice of budget at 100× the sustainable rate.
- Medium windows (3 days, 7 days): detect slowly- accelerating regressions.
Combining these is exactly the concepts/multi-window-multi-burn-rate alerting strategy Zalando adopted.
Typical error-budget calculations¶
For a 99.9% SLO over 28 days on a 1,000 RPS service:
- Total requests in window ≈ 1000 × 86,400 × 28 ≈ 2.42 × 10⁹.
- Budget = 0.1% × total ≈ 2.42 × 10⁶ failed requests permitted.
- Burn rate 1× = consuming the budget exactly over 28 days. Burn rate 14× = consuming the budget in 2 days.
MWMBR alerts fire on combinations like:
- Burn rate > 14× over 1h AND burn rate > 14× over 5min → page immediately (fast burn).
- Burn rate > 6× over 6h AND burn rate > 6× over 30min → page (medium burn).
- Burn rate > 1× over 3 days AND burn rate > 1× over 6h → open ticket (slow burn).
Zalando's post links the Google SRE Workbook chapter as the canonical threshold reference.
Interaction with CBOs¶
In an operation-based SLO model, each CBO has its own error budget — not each service. This is structurally different from the Google-book framing (which is service-keyed by default) and has two consequences:
- Exhaustion is felt by all teams on the CBO's path. A CBO budget exhausted on "checkout" forces feature freezes on inventory, pricing, payments, fraud, notifications — any team whose change could further hurt the CBO.
- Per-service budget becomes a debugging metric, not a decision metric. Per-service error rates still exist for diagnosis but stop driving ship-vs-freeze decisions.
Anti-patterns¶
- Budgets defined but never enforced. "We have a freeze policy but have never triggered it" means the budget is a metric, not a decision. The policy needs teeth — engineering leadership that actually pulls the cord.
- Budgets reset arbitrarily after an outage. Resetting the budget "because the outage was a one-off" destroys the primitive. The budget is whatever window-over-window math says; edits are reliability theatre.
- Budget too generous to bind. A 99.5% SLO on a system with 99.95% actual availability means the budget is never exhausted. The freeze never fires; the discipline never takes hold. Set targets that realistically bind sometimes.
- No short-window fast-burn alerting. A 28-day-only budget alert fires too late to respond to a 2-hour outage. MWMBR addresses this; unidimensional budget tracking does not.
Seen in¶
- sources/2022-04-27-zalando-operation-based-slos — Zalando's canonical canonicalisation of the error budget as the alert-threshold-driver once CBO alerting moved from raw-rate to MWMBR. "Because the Error Budget is derived from the SLO, it is still the SLO that made it possible to derive the alert threshold automatically."
Related¶
- concepts/service-level-objective — the primitive the budget is derived from.
- concepts/service-level-indicator — the metric the SLO is measured over.
- concepts/multi-window-multi-burn-rate — the alerting strategy that operationalises budget-burn-rate.
- concepts/operation-based-slo — the SLO altitude at which CBO budgets replace service budgets.
- concepts/critical-business-operation
- concepts/symptom-based-alerting
- concepts/alert-fatigue
- systems/google-sre-book