CONCEPT Cited by 1 source
On-call health metric¶
On-call health is a first-class, measurable KPI for an engineering team's on-call program — not a soft concern. The standard definition composes paging rate (how many pages per on-call week) and individual on-call frequency (how often a given engineer is on the rotation) into a health signal that reliability programs own alongside Availability.
Definition¶
Zalando names the KPI directly while describing its 2020 Embedded SRE team's measurement contract:
"On-call Health will be measured taking into account paging alerts and how often an individual is on-call." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
Two core inputs:
- Paging rate — pages per on-call shift (or per week). Counts both real incidents and false positives. Direct indicator of whether alerting is noisy.
- Individual on-call frequency — how often a given engineer takes the pager. Function of rotation size; a 3-person rotation pages each engineer 4× more often than a 12-person rotation for the same alert volume.
The composite matters because high paging rate is tolerable in a large rotation and unbearable in a small one.
Why it is a first-class KPI¶
Zalando names the motivation explicitly:
"Pager fatigue is something that should not be dismissed, and can hurt a team through lower productivity and employee attrition." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
Treating on-call health as measurable — rather than "the on-call person said it was rough this week" — makes reliability decisions quantitative:
- A new feature's launch increasing paging rate × 2 is visible in the metric the week after launch, not in anecdotes six months later.
- Reducing false-positive rate via concepts/symptom-based-alerting and concepts/anomaly-vs-incident-separation shows up as paging-rate reduction. The investment is justifiable.
- Rotation-size decisions (hire to expand rotation? cross- train with an adjacent team?) have a clear signal to improve.
Relationship to other KPIs¶
Part of the concepts/sre-kpi-portfolio Zalando established when creating the SRE department:
| KPI | Measures |
|---|---|
| Incident count | Volume of real incidents |
| MTTR | How fast the org resolves |
| False positive rate | Alert quality |
| Customer impact | Outcome on users |
| On-Call Health (this page) | Outcome on responders |
Customer impact and On-Call Health are the outcome-side KPIs — they measure what the program is for. The others measure the pipeline.
Interaction with Error Budget¶
An error budget over-alert — for example, paging on a momentary SLO dip that has not materially depleted the budget — is an on-call-health-negative event. Adopting concepts/multi-window-multi-burn-rate alerting on budget burn rate rather than raw SLO breach was, for Zalando, partly a noise reducer (fewer pages on spikes) and partly an on-call-health investment (pages are now signal- dense by construction).
Embedded-SRE alignment contract¶
The specific shape Zalando's Embedded SRE team uses: dual KPIs agreed between the SRE department and product area management:
- Availability — driven by the product area's SLOs.
- On-Call Health — paging rate + individual on-call frequency.
The two KPIs point in opposite directions under naïve design — chasing availability by adding alerts costs on-call health; coasting on on-call health by muting alerts costs availability. Holding both accountable forces the team to make principled trade-offs (e.g. reduce noisy alerts, fix the root cause rather than suppress the alert).
Failure modes¶
- Paging rate alone is not enough. A single paging-rate number doesn't distinguish 2 pages/week in a 4-person rotation (bearable) from 2 pages/week in a 12-person rotation (anomalously quiet, maybe under-alerting). Composite with rotation size.
- Lagging indicator for attrition. On-call-health drops may only translate to employee departures months later. Fast feedback requires leading signals (surveys, explicit complaints in retros).
- Measuring only the pager ignores the non-pager load. Tickets, async investigations, PRR work can consume on- call engineers without firing pages. On-call health should be composed with time-spent metrics over multi- week windows.
Seen in¶
- sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's Embedded SRE team gets On-Call Health as an explicit KPI alongside Availability, defined as "paging alerts and how often an individual is on-call". Cited alongside pager-fatigue as a driver of productivity and employee attrition.