Skip to content

CONCEPT Cited by 1 source

On-call health metric

On-call health is a first-class, measurable KPI for an engineering team's on-call program — not a soft concern. The standard definition composes paging rate (how many pages per on-call week) and individual on-call frequency (how often a given engineer is on the rotation) into a health signal that reliability programs own alongside Availability.

Definition

Zalando names the KPI directly while describing its 2020 Embedded SRE team's measurement contract:

"On-call Health will be measured taking into account paging alerts and how often an individual is on-call."sources/2021-10-14-zalando-tracing-sres-journey-part-iii

Two core inputs:

  1. Paging rate — pages per on-call shift (or per week). Counts both real incidents and false positives. Direct indicator of whether alerting is noisy.
  2. Individual on-call frequency — how often a given engineer takes the pager. Function of rotation size; a 3-person rotation pages each engineer 4× more often than a 12-person rotation for the same alert volume.

The composite matters because high paging rate is tolerable in a large rotation and unbearable in a small one.

Why it is a first-class KPI

Zalando names the motivation explicitly:

"Pager fatigue is something that should not be dismissed, and can hurt a team through lower productivity and employee attrition."sources/2021-10-14-zalando-tracing-sres-journey-part-iii

Treating on-call health as measurable — rather than "the on-call person said it was rough this week" — makes reliability decisions quantitative:

  • A new feature's launch increasing paging rate × 2 is visible in the metric the week after launch, not in anecdotes six months later.
  • Reducing false-positive rate via concepts/symptom-based-alerting and concepts/anomaly-vs-incident-separation shows up as paging-rate reduction. The investment is justifiable.
  • Rotation-size decisions (hire to expand rotation? cross- train with an adjacent team?) have a clear signal to improve.

Relationship to other KPIs

Part of the concepts/sre-kpi-portfolio Zalando established when creating the SRE department:

KPI Measures
Incident count Volume of real incidents
MTTR How fast the org resolves
False positive rate Alert quality
Customer impact Outcome on users
On-Call Health (this page) Outcome on responders

Customer impact and On-Call Health are the outcome-side KPIs — they measure what the program is for. The others measure the pipeline.

Interaction with Error Budget

An error budget over-alert — for example, paging on a momentary SLO dip that has not materially depleted the budget — is an on-call-health-negative event. Adopting concepts/multi-window-multi-burn-rate alerting on budget burn rate rather than raw SLO breach was, for Zalando, partly a noise reducer (fewer pages on spikes) and partly an on-call-health investment (pages are now signal- dense by construction).

Embedded-SRE alignment contract

The specific shape Zalando's Embedded SRE team uses: dual KPIs agreed between the SRE department and product area management:

  • Availability — driven by the product area's SLOs.
  • On-Call Health — paging rate + individual on-call frequency.

The two KPIs point in opposite directions under naïve design — chasing availability by adding alerts costs on-call health; coasting on on-call health by muting alerts costs availability. Holding both accountable forces the team to make principled trade-offs (e.g. reduce noisy alerts, fix the root cause rather than suppress the alert).

Failure modes

  • Paging rate alone is not enough. A single paging-rate number doesn't distinguish 2 pages/week in a 4-person rotation (bearable) from 2 pages/week in a 12-person rotation (anomalously quiet, maybe under-alerting). Composite with rotation size.
  • Lagging indicator for attrition. On-call-health drops may only translate to employee departures months later. Fast feedback requires leading signals (surveys, explicit complaints in retros).
  • Measuring only the pager ignores the non-pager load. Tickets, async investigations, PRR work can consume on- call engineers without firing pages. On-call health should be composed with time-spent metrics over multi- week windows.

Seen in

  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's Embedded SRE team gets On-Call Health as an explicit KPI alongside Availability, defined as "paging alerts and how often an individual is on-call". Cited alongside pager-fatigue as a driver of productivity and employee attrition.
Last updated · 550 distilled / 1,221 read