Skip to content

CONCEPT Cited by 2 sources

Error budget

The error budget is the complement of an SLO: if the SLO target is 99.9% over a 28-day window, the error budget is the 0.1% of traffic the service is allowed to fail. The budget is a first-class operational resource — consumed by outages, bugs, and bad deploys; preserved by reliability work — and its depletion drives both alerting decisions (see concepts/multi-window-multi-burn-rate) and feature-velocity decisions (feature freezes when the budget is exhausted).

Definition

The Google SRE book's original framing: "The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow." Zalando operationalises this at the CBO altitude:

"The Error Budget became much more relevant with this change. The alerts went from being triggered whenever the error rate breached the SLO, to having the decision of whether a page should go out or not based on the rate we are burning the error budget for an operation."sources/2022-04-27-zalando-operation-based-slos

Three properties:

  1. Derived, not chosen. Budget = (1 − SLO_target) × total traffic over the window. Once the SLO is agreed, the budget is arithmetic; no separate negotiation.
  2. Consumable. Every failed request debits the budget. Outages consume a lot quickly; slow regressions consume a little over time.
  3. Decision-driving. Budget state (healthy / burning / exhausted) is the primary input to reliability-vs-velocity decisions and to MWMBR alert thresholds.

Why the error budget matters

Without it, SLOs are metrics that get reported but don't drive behaviour:

  • Feature-velocity vs reliability becomes political. Teams argue over "is this outage bad enough to stop shipping?" without an objective cutoff. With a budget, exhaustion is the cutoff.
  • Alert thresholds become intuition-driven. Without a budget, engineers pick "alert when error rate > 0.5%" by gut. With the budget, the alert fires when the current burn rate would exhaust the budget within some multiple of the SLO window — a principled threshold.
  • Short-term spikes and long-term regressions aren't distinguishable. Raw-rate alerting catches both or neither. Budget-burn-rate alerting can distinguish them — fast burn = outage = page; slow burn = silent regression = ticket.

Error-budget policies

The policy is the rule for what happens when the budget is low / exhausted. Common shapes:

  • Feature freeze on exhaustion. The team stops shipping net-new features and spends 100% on reliability work until the budget recovers. Canonical Google policy.
  • Deploy-gating. Risky deploys are delayed; lower-risk fixes and reliability work proceed.
  • Cross-team escalation. For multi-service / multi-team operations (e.g. a CBO), budget depletion triggers a cross- team response rather than just the nearest team's freeze.

Zalando's instantiation at the CBO altitude means the policy is inherited by every team touching the operation — so a single CBO's exhaustion can gate feature velocity across 15+ services whose changes could further hurt the CBO.

Budget consumption over multiple windows

Most orgs track the budget over multiple simultaneous windows:

  • Long window (28 days): the authoritative SLO-compliance window; matches the SLO's definition.
  • Short windows (1h, 6h): detect fast burn — an ongoing outage is exhausting this hour's slice of budget at 100× the sustainable rate.
  • Medium windows (3 days, 7 days): detect slowly- accelerating regressions.

Combining these is exactly the concepts/multi-window-multi-burn-rate alerting strategy Zalando adopted.

Typical error-budget calculations

For a 99.9% SLO over 28 days on a 1,000 RPS service:

  • Total requests in window ≈ 1000 × 86,400 × 28 ≈ 2.42 × 10⁹.
  • Budget = 0.1% × total ≈ 2.42 × 10⁶ failed requests permitted.
  • Burn rate 1× = consuming the budget exactly over 28 days. Burn rate 14× = consuming the budget in 2 days.

MWMBR alerts fire on combinations like:

  • Burn rate > 14× over 1h AND burn rate > 14× over 5min → page immediately (fast burn).
  • Burn rate > 6× over 6h AND burn rate > 6× over 30min → page (medium burn).
  • Burn rate > 1× over 3 days AND burn rate > 1× over 6h → open ticket (slow burn).

Zalando's post links the Google SRE Workbook chapter as the canonical threshold reference.

Interaction with CBOs

In an operation-based SLO model, each CBO has its own error budget — not each service. This is structurally different from the Google-book framing (which is service-keyed by default) and has two consequences:

  • Exhaustion is felt by all teams on the CBO's path. A CBO budget exhausted on "checkout" forces feature freezes on inventory, pricing, payments, fraud, notifications — any team whose change could further hurt the CBO.
  • Per-service budget becomes a debugging metric, not a decision metric. Per-service error rates still exist for diagnosis but stop driving ship-vs-freeze decisions.

Anti-patterns

  • Budgets defined but never enforced. "We have a freeze policy but have never triggered it" means the budget is a metric, not a decision. The policy needs teeth — engineering leadership that actually pulls the cord.
  • Budgets reset arbitrarily after an outage. Resetting the budget "because the outage was a one-off" destroys the primitive. The budget is whatever window-over-window math says; edits are reliability theatre.
  • Budget too generous to bind. A 99.5% SLO on a system with 99.95% actual availability means the budget is never exhausted. The freeze never fires; the discipline never takes hold. Set targets that realistically bind sometimes.
  • No short-window fast-burn alerting. A 28-day-only budget alert fires too late to respond to a 2-hour outage. MWMBR addresses this; unidimensional budget tracking does not.

Seen in

  • sources/2022-04-27-zalando-operation-based-slos — Zalando's canonical canonicalisation of the error budget as the alert-threshold-driver once CBO alerting moved from raw-rate to MWMBR. "Because the Error Budget is derived from the SLO, it is still the SLO that made it possible to derive the alert threshold automatically."
  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — documents two elevations of Error Budget to first- class primitive in 2020. (a) Paging primitive: "Deciding whether to page someone or not was no longer whether the SLO was breached or not, but rather whether the Error Budget was in risk of being depleted or not." Adaptive Paging pages on budget burn rate, not raw SLO breach. (b) UI primitive: the new Service Level Management tool adds a view of remaining Error Budget per operation, "to steer prioritization of development work." Without budget visibility, feature- velocity-vs-reliability trades can't happen.
Last updated · 542 distilled / 1,571 read