CONCEPT Cited by 1 source
On-call rotation¶
An on-call rotation is the schedule that assigns a small set of engineers to be reachable for production incidents during a defined window — typically 24×7, rotating weekly or bi-weekly. Rotation size and membership determine sleep cost per engineer, response latency, and the breadth of system knowledge each responder must carry.
Scaling break point — monolith vs microservice era¶
Zalando names the canonical break point directly. Pre-cloud, 5 on-call teams covered the entire Zalando stack because services were monolithic, similar in shape, and each team had a large rotation with deep domain knowledge:
"Before we started the microservice migration, our service landscape was small enough that 5 on-call teams could cover the whole stack. Each team had a large enough rotation, and the domain was well understood by each team member. The monoliths were also quite similar in terms of monitoring and operations…" (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
When the stack became microservices, the same 5 teams could not scale. Service count grew faster than team count. Per- service standardisation was low — monitoring, alerting, and on-call runbooks differed across services — so the cognitive load for responders became the bottleneck, not the raw workload.
The break point forces a structural choice:
- Grow rotation breadth — more services per team → quickly hits the cognitive-load ceiling.
- Shift to team-owned on-call — each delivery team on-call for its own services → solves cognitive load but multiplies rotations. See concepts/you-build-it-you-run-it.
- Stand up an SRE function to own the primitives — shared observability, standardised alerting, runbook templates → lowers cognitive load per incident. See concepts/sre-organizational-evolution.
Most orgs do all three over time.
Failure modes¶
- Rotation too small → burnout, single-point-of-failure, alert-fatigue-by-overload.
- Rotation too large → individual responders lose familiarity with the stack.
- Mixed-surface rotation → on-call covers services the responder has never touched; MTTR grows. This was the Zalando pre-migration problem.
Seen in¶
- sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i — 5-teams-pre-cloud data point; cognitive-load break point in the microservice migration.