Skip to content

CONCEPT Cited by 1 source

SRE KPI portfolio

An SRE KPI portfolio is the minimum measurable-outcome set an SRE function adopts to steer its own work. Without a portfolio, claims like "the incident process is better now" remain anecdotal; with one, program-level decisions (investment, scope, hiring, embedded teams) become justifiable.

The canonical portfolio

Zalando's 2020 SRE department, on formation, named its portfolio explicitly:

"One of the first things we did after creating the department was to define the KPIs that would guide our work, make sure they were being measured, and facilitate the reporting of those KPIs."sources/2021-10-14-zalando-tracing-sres-journey-part-iii

Four KPIs for the incident pipeline:

KPI Definition Lever
Incident count Real user-impacting events / period Reliability engineering, root-cause fixes
MTTR Mean time to resolve per incident Incident response tooling, runbooks, on-call skill
False positive rate Non-incident pages / total pages concepts/symptom-based-alerting, concepts/multi-window-multi-burn-rate
Customer impact User-visible effect per incident (time × scope) SLO coverage + graceful degradation

A fifth KPI, aimed at the team rather than the pipeline:

Why each KPI is load-bearing

  • Incident count alone rewards hiding events. A team that narrowly defines "incident" or declares things as anomalies will look great on this KPI while customer impact worsens. Pair with customer impact.
  • MTTR alone rewards racing the mitigation clock. A team that aggressively mitigates with fragile workarounds shortens MTTR but may grow incident count. Pair with incident count.
  • False positive rate alone rewards muting alerts. A team that turns off alerts scores perfectly on FPR while blowing incidents. Pair with customer impact, which bounces when alerts stop firing on real events.
  • Customer impact alone is hard to measure per-event accurately — requires SLO coverage + attribution of user traffic to failed operations.
  • On-call health alone rewards scheduling few pages. Bounded by availability from above — if SLOs slip, on- call health falling silent isn't actually healthy, it's undermeasurement.

The portfolio shape — four incident-pipeline KPIs plus on- call health — is designed so that optimising any single one hurts another.

The anomaly/incident separation makes the portfolio

measurable

Without concepts/anomaly-vs-incident-separation, these KPIs collapse into each other:

  • If a 30-second anomaly is an "incident", incident count balloons, MTTR drops (mitigation time ~ 0), false positive rate is ambiguous.
  • If every page is classified after the fact as anomaly or incident, the four KPIs separate cleanly — FPR measures anomaly rate, incident count measures real events, MTTR measures mitigation speed on confirmed incidents, customer impact is measurable only on incidents.

The separation is a precondition for the portfolio to steer decisions.

Interaction with SLOs and Error Budget

An operation-based SLO portfolio and an SRE KPI portfolio sit at different abstraction levels:

  • SLOs / Error Budgetper operation, per user journey. Measures whether the product is meeting its reliability contract with users. Drives alerting (via concepts/multi-window-multi-burn-rate) and feature prioritisation (via budget remainder).
  • SRE KPI portfolioper SRE program. Measures whether the reliability function is improving. Drives investment and staffing decisions within SRE.

A program can have healthy SLOs while the SRE KPI portfolio is deteriorating (e.g. incident count is low but MTTR is rising because investigation tooling decayed), and vice versa. Both portfolios are needed.

Embedded SRE team alignment

Zalando's Embedded SRE team for Checkout is agreed on two KPIs with product-area management:

  • Availability — drawn from the product area's operation-based SLOs.
  • On-Call Health — from the SRE KPI portfolio.

The portfolio provides the lingua franca for cross- reporting-chain alignment; both the embedded team (reports to SRE dept) and the product area agree on the same definitions.

Caveats

  • Lagging indicators. MTTR, FPR, customer impact are computed over multi-week windows. Don't expect daily responsiveness.
  • Reporting infrastructure is load-bearing. Zalando names "facilitate the reporting of those KPIs" as part of the initial effort. Without dashboards and automation, KPI reporting decays into quarterly manual rollups.
  • Portfolio composition varies. Not every org measures all five cleanly. Organisations with per- service ownership may measure MTTR per service rather than per-program.
  • Customer impact is ambiguous. The Zalando post names it but doesn't give a unit. Possible formulations: error minutes × affected-users, or error- budget-burn × traffic-volume — both require SLO coverage.

Seen in

Last updated · 550 distilled / 1,221 read