CONCEPT Cited by 1 source
SRE organizational evolution¶
SRE organizational evolution names the recurring three- phase pattern by which engineering organizations scale Site Reliability Engineering practice from a handful of on-call teams to hundreds: grassroots champions → embedded practice → dedicated department. Each phase solves the problem the prior phase couldn't, and each is load-bearing.
The three phases¶
Phase 1 — Grassroots: a volunteer SRE cohort educates the fleet¶
- Trigger: the number of on-call teams grows beyond what a single central reliability team can support. Reliability knowledge is thinly distributed; individual teams reinvent timeout / retry / overload-handling policy poorly.
- Shape: ~5–15 SRE-passionate engineers (not a formal team) perform production-readiness reviews across critical services, run workshops on reliability patterns, and identify "clusters of applications that required adjustments, so that the platform is stable in case of various failure types (e.g. failures of dependencies, overload, timeouts)" (sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week).
- Limitation: doesn't scale with the fleet. Volunteers burn out; cross-cutting observability infrastructure is nobody's day job.
Phase 2 — Embedded practice: shared primitives deployed fleet-wide¶
- Trigger: Phase 1 reveals that every on-call team needs the same observability primitive (distributed tracing, standardised logs, baseline metrics) and reinventing per team is wasteful.
- Shape: standardise on cross-cutting primitives — OpenTracing / OpenTelemetry instrumentation, shared dashboards, alert conventions — and roll them out tier-gated: first to tier-1 hot-path services, then tier-2, etc. Zalando's Cyber-Week-scoped OpenTracing rollout is the canonical example.
- Limitation: who owns the primitive long-term? Who runs the guild? Grassroots champions still can't hire or prioritise with authority.
Phase 3 — Dedicated department: SRE as an org function¶
- Trigger: the cross-cutting primitives and practices from Phase 2 need a owner with charter and budget. Hiring into the practice needs a job family.
- Shape: formal SRE department owning reliability engineering, observability, monitoring/logging/tracing infrastructure, and enablement (trainings, guild, best-practice formulation). Cross-team knowledge exchange runs as structured guilds rather than ad-hoc mentorship. Novel ops primitives become feasible here — e.g., adaptive paging as a department-owned platform capability.
- What it's not: not the team that holds all on-call. Service teams still run their own services. The department is a platform + enablement function.
Why the evolution tends to be unidirectional¶
Each phase requires the prior phase as a prerequisite:
- You can't land a dedicated SRE dept charter without demonstrated value from grassroots efforts + embedded practice — executives need proof the discipline matters.
- You can't run useful fleet-wide PRR without a cohort of people who've done it ad-hoc enough to know what the review should cover.
- Skipping to Phase 3 with an external hire tends to create an "ivory-tower SRE" pattern — the department ships policy; no service team adopts it because the social capital and context aren't there.
Forcing functions¶
Recurring high-stakes events (Cyber Week, annual gaming / streaming peaks, financial year-end) are the typical forcing function that drives phase transitions. "Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before." See patterns/annual-peak-event-as-capability-forcing-function.
Data points¶
- Zalando (2014 → 2020): ~6 on-call teams → ~100 on-call teams; grassroots (Phase 1) → OpenTracing rollout (Phase 2) → formal SRE department with guild + observability infra ownership + adaptive-paging platform (Phase 3) (sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week).
Seen in¶
- sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week — canonical three-phase evolution narrated directly.