Skip to content

PATTERN Cited by 2 sources

Grassroots SRE rollout

Grassroots SRE rollout is the bottom-up pattern for introducing Site Reliability Engineering in an organisation: a small coalition of SRE-interested engineers pitches the discipline to management, wins charter, and drives initial adoption — without a top-down mandate, pre-existing SRE department, or outside hire.

Shape

  1. Coalition forms — a handful of engineers who have read the Google SRE book (or equivalent) recognise the org's pain matches the book's problem statements.
  2. Pitch management — present the pain points (on-call overload, inconsistent reliability, no SLOs) and SRE as a solution. Emphasise concrete wins (e.g. standardised on-call tooling) over ideology.
  3. Structural debate — decide the SRE team shape. Options typically include central team, per-team embed, or per-product- cluster team (Zalando's choice).
  4. Baseline primitives rollout — SLOs, SLIs, an SLO reporting tool, and workshops on reliability patterns (retries, circuit breakers, fallbacks) applied to critical services first.
  5. Forcing function — pair the rollout with a visible deadline (peak event, annual planning cycle) to drive adoption urgency. Cyber Week prep in the Zalando case.

When it works

  • Culture is open to change, so the coalition can pitch upward without friction. Zalando explicitly names this as the enabler: "Zalando is a company that does not shy away from change. It's a core part of the company's DNA." (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
  • Pain is acute enough that engineering managers will sponsor upward without mandating.
  • The coalition has credibility — engineers with incident scar tissue, not newcomers.

When it fails

  • Product management stays uninvolved. SLOs get defined by engineers and ignored by PMs. Service-level targets don't influence roadmap decisions, so the discipline doesn't change behaviour. Zalando's 2016 attempt failed on exactly this axis.
  • No owner for cross-cutting primitives. Without a chartered team, observability infrastructure, SLR tooling, and training materials live in nobody's day job and decay.
  • Coalition burnout. Volunteers run the program alongside their regular work. Attrition from the coalition rolls back the rollout.

Known failure case

Zalando 2016 → 2017 — coalition formed, management bought in, SRE structure debated, SLOs rolled out, Reliability Workshops held. Attempt stalled because SLOs never became a PM primitive and senior management preferred team-owned on-call. Resolution was a pivot to concepts/you-build-it-you-run-it; SRE re-emerged differently in Parts II & III of the retrospective. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)

Known success case

Zalando 2017 → 2020 (second attempt, retrospected in the 2020 Cyber Week post) — after the 2016 attempt stalled and the ownership model shifted, grassroots production-readiness reviews and OpenTracing rollout seeded Phase 2 of the three-phase evolution. The pattern can fail once and still seed a later success. (Source: sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week)

Seen in

Last updated · 476 distilled / 1,218 read