Skip to content

CONCEPT Cited by 1 source

On-call rotation

An on-call rotation is the schedule that assigns a small set of engineers to be reachable for production incidents during a defined window — typically 24×7, rotating weekly or bi-weekly. Rotation size and membership determine sleep cost per engineer, response latency, and the breadth of system knowledge each responder must carry.

Scaling break point — monolith vs microservice era

Zalando names the canonical break point directly. Pre-cloud, 5 on-call teams covered the entire Zalando stack because services were monolithic, similar in shape, and each team had a large rotation with deep domain knowledge:

"Before we started the microservice migration, our service landscape was small enough that 5 on-call teams could cover the whole stack. Each team had a large enough rotation, and the domain was well understood by each team member. The monoliths were also quite similar in terms of monitoring and operations…" (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)

When the stack became microservices, the same 5 teams could not scale. Service count grew faster than team count. Per- service standardisation was low — monitoring, alerting, and on-call runbooks differed across services — so the cognitive load for responders became the bottleneck, not the raw workload.

The break point forces a structural choice:

  • Grow rotation breadth — more services per team → quickly hits the cognitive-load ceiling.
  • Shift to team-owned on-call — each delivery team on-call for its own services → solves cognitive load but multiplies rotations. See concepts/you-build-it-you-run-it.
  • Stand up an SRE function to own the primitives — shared observability, standardised alerting, runbook templates → lowers cognitive load per incident. See concepts/sre-organizational-evolution.

Most orgs do all three over time.

Failure modes

  • Rotation too small → burnout, single-point-of-failure, alert-fatigue-by-overload.
  • Rotation too large → individual responders lose familiarity with the stack.
  • Mixed-surface rotation → on-call covers services the responder has never touched; MTTR grows. This was the Zalando pre-migration problem.

Seen in

Last updated · 476 distilled / 1,218 read