CONCEPT Cited by 1 source
SRE program¶
An SRE Program is a time-bounded coordination structure spanning multiple SRE teams in different reporting chains, that unites them under a shared goal-set, roadmap, and meeting cadence without requiring a formal org merger. It solves the coordination problem when SRE work is federated across departments but organisational authority is not.
Definition¶
Zalando's 2018 instantiation:
"Both SRE teams had very limited resources (only 2 engineers each), and they obviously shared the same goals. To better align the efforts of both teams, an SRE Program is kicked-off that unites them around common goals. [...] The SRE Program was not aiming at significant organizational changes. This gave some degree of freedom regarding the projects the Program would tackle." — sources/2021-09-20-zalando-tracing-sres-journey-part-ii
Three defining properties:
- Shared goals + backlog — the teams work from one project list even though they belong to different departments.
- Lightweight authority structure — typically a Program Manager + a Program Lead (Zalando: 1 Lead + 1 PM + 4 Engineers = 6 people), not a formal re-org.
- Time-bounded charter — the Zalando SRE Program ran for 2018 and ended after Cyber Week; by design, not permanent. Teams revert to their departmental backlogs after the program year.
Why programs are useful¶
- Small SRE teams in separate reporting chains. 2 + 2 = 4 engineers individually cannot cover observability / incident response / PRR / Cyber Week prep for an enterprise. Pooling under a shared backlog multiplies effective capacity.
- Shared goals, different scopes. One team's scope is company-wide (DF = Digital Foundation in Zalando); the other's is a single department (DX = Digital Experience). A program provides a unifying backlog without forcing the company-wide team to also own the DX-only problems or vice versa.
- Organisational friction avoidance. Merging two teams across reporting chains is expensive; a program gets most of the benefit without the political cost.
- Forcing-function compatibility. Programs align well with annual peak events (Cyber Week, Black Friday). The program ends when the event ends; the teams know they have a finite horizon and pick initiatives that fit.
When programs fail¶
Zalando's 2018 program ran for a year and hit two structural limits that drove the 2019 merger into a single team:
- Inconsistent guidance to teams asking both teams for help. "Teams in Zalando would seek out guidance from SRE, not knowing which team to reach out to, or even that there were 2 separate teams." DX could ship department- specific answers; DF had to ship answers that generalised across the company. For the same question, the two teams would sometimes arrive at different answers.
- Different priorities from different reporting chains. Program alignment doesn't override each team's departmental leadership. When the chains disagree, the program stalls.
- Returning to departmental backlogs after the program year. "Following that plan, after Cyber Week was over the program ended and each team went back to work on projects relevant to their respective departments." If the cross-cutting capability requires continuing investment, the program's end date becomes a de-facto investment cut- off.
The transition Zalando made was to merge into a single unified team — see [[patterns/unified-sre-team-over- federated]] as the explicit pattern name.
Program vs merged team vs embedded SRE¶
| Structure | Best for | Failure mode |
|---|---|---|
| Federated teams + program | Short-term coordination across org boundaries | Inconsistent guidance, priority drift when chains diverge |
| Single unified team | Stable long-term capability with one voice | Department-local needs neglected if team's charter is company-wide |
| Embedded SRE per team | Deep domain knowledge, team-specific reliability | Cross-cutting standards, shared tooling neglected |
Organisations tend to rotate through these structures over time — the program is a transitional structure.
Phase in SRE organizational evolution¶
In the concepts/sre-organizational-evolution model, an SRE Program is a Phase-2 bridging mechanism — it lets the org deliver embedded-practice rollouts (shared primitives deployed fleet-wide, e.g. OpenTracing) before the Phase-3 dedicated-department transition completes. Zalando's 2018 Program → 2019 merger trajectory is the canonical example.
Implementation shape (Zalando)¶
- 6 people total: 1 Lead, 1 Program Manager, 4 Engineers.
- Explicit backlog curation: "when we were done, the size of the list [of SRE-relevant topics] was considerable [...] we had to drop many of the topics we wanted to work on."
- Three-dimension prioritisation: (1) likelihood of success, (2) company priorities, (3) enablement value (learn-out-of the engagement vs do-it-yourself).
- 2018 initiatives: Distributed Tracing rollout, Page Load Time, Incident Commander role staffing, Cyber Week Load Tests, efficiency / cost-savings projects.
- 2019 handover: teams merge into a unified DF SRE team of 7; program effectively becomes the team.
Seen in¶
- sources/2021-09-20-zalando-tracing-sres-journey-part-ii — 2018 SRE Program covering DX SRE + SRE Enablement with a Lead + PM + 4 engineers.
Related¶
- concepts/sre-organizational-evolution — SRE Program as a Phase-2 structure.
- patterns/unified-sre-team-over-federated — the successor pattern when programs stop working.
- concepts/production-readiness-review
- concepts/observability