Skip to content

CONCEPT Cited by 1 source

SRE organizational evolution

SRE organizational evolution names the recurring three- phase pattern by which engineering organizations scale Site Reliability Engineering practice from a handful of on-call teams to hundreds: grassroots champions → embedded practice → dedicated department. Each phase solves the problem the prior phase couldn't, and each is load-bearing.

The three phases

Phase 1 — Grassroots: a volunteer SRE cohort educates the fleet

  • Trigger: the number of on-call teams grows beyond what a single central reliability team can support. Reliability knowledge is thinly distributed; individual teams reinvent timeout / retry / overload-handling policy poorly.
  • Shape: ~5–15 SRE-passionate engineers (not a formal team) perform production-readiness reviews across critical services, run workshops on reliability patterns, and identify "clusters of applications that required adjustments, so that the platform is stable in case of various failure types (e.g. failures of dependencies, overload, timeouts)" ().
  • Limitation: doesn't scale with the fleet. Volunteers burn out; cross-cutting observability infrastructure is nobody's day job.

Phase 2 — Embedded practice: shared primitives deployed fleet-wide

  • Trigger: Phase 1 reveals that every on-call team needs the same observability primitive (distributed tracing, standardised logs, baseline metrics) and reinventing per team is wasteful.
  • Shape: standardise on cross-cutting primitives — OpenTracing / OpenTelemetry instrumentation, shared dashboards, alert conventions — and roll them out tier-gated: first to tier-1 hot-path services, then tier-2, etc. Zalando's Cyber-Week-scoped OpenTracing rollout is the canonical example.
  • Limitation: who owns the primitive long-term? Who runs the guild? Grassroots champions still can't hire or prioritise with authority.

Phase 3 — Dedicated department: SRE as an org function

  • Trigger: the cross-cutting primitives and practices from Phase 2 need a owner with charter and budget. Hiring into the practice needs a job family.
  • Shape: formal SRE department owning reliability engineering, observability, monitoring/logging/tracing infrastructure, and enablement (trainings, guild, best-practice formulation). Cross-team knowledge exchange runs as structured guilds rather than ad-hoc mentorship. Novel ops primitives become feasible here — e.g., adaptive paging as a department-owned platform capability.
  • What it's not: not the team that holds all on-call. Service teams still run their own services. The department is a platform + enablement function.

Why the evolution tends to be unidirectional

Each phase requires the prior phase as a prerequisite:

  • You can't land a dedicated SRE dept charter without demonstrated value from grassroots efforts + embedded practice — executives need proof the discipline matters.
  • You can't run useful fleet-wide PRR without a cohort of people who've done it ad-hoc enough to know what the review should cover.
  • Skipping to Phase 3 with an external hire tends to create an "ivory-tower SRE" pattern — the department ships policy; no service team adopts it because the social capital and context aren't there.

Forcing functions

Recurring high-stakes events (Cyber Week, annual gaming / streaming peaks, financial year-end) are the typical forcing function that drives phase transitions. "Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before." See patterns/annual-peak-event-as-capability-forcing-function.

Data points

  • Zalando (2014 → 2020): ~6 on-call teams → ~100 on-call teams; grassroots (Phase 1) → OpenTracing rollout (Phase 2) → formal SRE department with guild + observability infra ownership + adaptive-paging platform (Phase 3) ().
  • Zalando (2018 → 2019): Phase-2 → Phase-3 transition narrated directly. Q1 2018 two sister 2-engineer SRE teams bootstrap (SRE Enablement in DF + Digital Experience SRE in DX), coordinated via a 6-person SRE Program (1 Lead + 1 PM + 4 Engineers). 2019 merger into a unified 7-person DF team under the patterns/unified-sre-team-over-federated pattern; driver was inconsistent guidance when internal customers couldn't tell the two teams apart ().
  • Zalando (2019 → 2020): Phase-3 completion via reorg — late-2019 Central Functions reorg folds SRE Enablement team together with the monitoring services / infrastructure teams and Incident Management into a single SRE department. 2020 SRE Strategy published, anchored on Observability as standardisation target (language-specific SDKs per Tech Radar); four process/product moves follow: new incident process separating anomalies from incidents, MWMBR-derived alert rules in Adaptive Paging, Error- Budget-aware Service Level Management tool, and the SRE Curriculum in onboarding. Also spawns Zalando's first embedded SRE team for Checkout via customer pull. The reorg — not the grassroots movement — was the Phase-3 trigger (sources/2021-10-14-zalando-tracing-sres-journey-part-iii).

Counterexample — Phase 1 can fail and rewind

The 2016 first Zalando SRE attempt () shows the three-phase model is the usual trajectory, not the only one. A grassroots coalition pitched SRE, secured management buy-in, chose one SRE team per Product Cluster, ran the SRE-interest survey (software engineering + operational mindset + systems engineering + software architecture + troubleshooting), rolled out SLOs / SLIs, built an in-house SLO Reporting tool (SLR), and delivered Reliability Workshops on retries / circuit breakers / fallbacks. The attempt stalled — SLOs never became a product-management primitive, and instead of ramping SRE teams, senior management pivoted to "you build it, you run it" (every delivery team on-call for its own critical services). SRE as an organisational function had to re-emerge later under the new constraint (Part II). Lesson: Phase 1 can produce durable artefacts (SLR, workshops, ownership culture) even when the program itself rewinds.

Seen in

  • — canonical three-phase evolution narrated directly.
  • — Phase-2 → Phase-3 transition detailed: sister teams → SRE Program → unified team. Canonicalises both the SRE Program as a Phase-2 bridging structure and unified-over-federated as the Phase-3 landing pattern.
  • negative Phase-1 data point. Documents the 2016 grassroots attempt that stalled; "you build it, you run it" adoption was a consequence of the SRE rollout stalling, not an ideological choice.
  • sources/2021-10-14-zalando-tracing-sres-journey-part-iiiPhase-3 completion via top-down reorg. Monitoring teams + Incident Management + SRE Enablement fold into one SRE department (late 2019); 2020 SRE Strategy + Embedded SRE team for Checkout + SRE Curriculum. Notable qualifier: Phase 3 arrived via reorg, not organic grassroots continuation.
Last updated · 542 distilled / 1,571 read