CONCEPT Cited by 1 source
SRE curriculum¶
An SRE curriculum is a structured, async-consumable body of training material on reliability practices — typically video + quiz per topic, reviewed by subject-matter experts — that scales enablement to hundreds of engineering teams in a way that ad-hoc in-person workshops cannot.
Definition¶
Zalando named and launched an SRE Curriculum in late 2020:
"Late in 2020 we began developing what we called the SRE Curriculum. This was an initiative that aimed at scaling the educational benefits of SRE. [...] the deliverables of this new format were a video and a quiz for each topic, with the content of each training being created and reviewed by subject matter experts to ensure a common understanding and a high quality training." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
Core properties:
- Video + quiz per topic. Not a single course; a library of self-contained modules.
- SME-reviewed content. Subject-matter experts (usually the SRE Enablement engineers who built the underlying primitive) review the material to prevent drift from the canonical practice.
- Folded into onboarding. "by having those training sessions part of the onboarding process, any engineer joining Zalando would get an introduction to some of the SRE practices we were rolling out."
- Async and self-paced. "consumed by anyone in the company at any given time and different pace." Doesn't require an instructor.
Why it replaces in-person workshops¶
Zalando's pre-2020 pattern was ad-hoc in-person sessions on incident response, distributed tracing, and alerting strategies — requested by individual teams, delivered by SRE Enablement engineers. Two problems at scale:
- Doesn't cover new hires. An engineer hired six months after the last workshop never gets the training.
- Doesn't scale with fleet growth. Hundreds of teams × repeated topic workshops = SRE Enablement engineers spend their time delivering training rather than building platform.
The pandemic forced remote format and accelerated the pivot — "we did try to do some via video conference, but it did not have quite the same result" — producing the async format as the structural answer.
Shape of the curriculum¶
Three topics explicitly named in the Zalando post:
- Incident response — how Zalando's incident process works (including anomaly vs incident separation), Incident Commander role, postmortem expectations.
- Distributed tracing — how OpenTracing instrumentation works, what the Zalando semantic conventions are, how to use trace data to debug.
- Alerting strategies — concepts/symptom-based-alerting vs cause-based alerting, SLO-driven alert rules (concepts/multi-window-multi-burn-rate), concepts/adaptive-paging behavior.
The three topics correspond to the three pillars of the 2020 SRE Strategy — Observability (tracing), Alerting (symptom-based + MWMBR), Incident Management (process) — each rendered as an onboardable module.
Why a Tech Academy partnership matters¶
Zalando partnered with the company's Tech Academy to produce the curriculum. The value of that partnership:
- Production quality. Tech Academy owns recording infrastructure (Zalando's post includes a photo of "the studio where we recorded some of the training sessions"). SRE Enablement engineers are not videographers.
- Curriculum governance. Tech Academy handles the course catalog, enrollment tracking, and onboarding integration. SRE can ship content without owning delivery.
- Cross-function pattern. The Tech Academy format is shared with other internal curricula — a reliability engineer hired from another background recognises the format immediately.
Relationship to SRE organizational evolution¶
An SRE curriculum is a Phase 3 artefact in concepts/sre-organizational-evolution:
- Phase 1 (grassroots): ad-hoc workshops, on request.
- Phase 2 (shared primitives): more structured workshops tied to annual events like Cyber Week.
- Phase 3 (dedicated department): async curriculum folded into onboarding. Only becomes feasible when the department owns enablement as a durable function and can partner with company-wide learning infrastructure.
Caveats¶
- Content drift. Video content decays when the underlying practice evolves. Zalando's post doesn't discuss refresh cadence; SME review at creation is named but not ongoing review.
- Doesn't replace hands-on. The curriculum introduces practices; deep competence still requires real on-call participation and incident experience.
- Onboarding inclusion is political. A curriculum that doesn't make it into the core onboarding track ends up consumed only by motivated engineers. Zalando succeeded on this axis (named as an explicit outcome) but the post doesn't name what organisational negotiation was needed.
- Async format is a fit for introductory content. Zalando names "an introduction to some of the SRE practices" — not deep competence. Deep material (e.g. burn-rate calculation math, alert-rule tuning) may need interactive formats.
Seen in¶
- sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's 2020 SRE Curriculum, co-produced with the Tech Academy; video+quiz format; three topics (incident response, distributed tracing, alerting strategies); folded into engineer onboarding.