CONCEPT Cited by 1 source
Production readiness review¶
A Production Readiness Review (PRR) is an SRE-book-named structured audit of a service's reliability posture, performed before the service takes on a new reliability risk — e.g. before onboarding traffic, before being declared critical, or (Zalando's framing) before the annual peak event. The canonical reference is the Google SRE book's chapter Evolving SRE Engagement Model, which describes PRR as "the most typical initial step [for SRE engagement with] a service operating in production."
What it reviews¶
A PRR is not a code review; it's a reliability-properties review. Typical coverage:
- Dependency-failure behavior — what happens when each upstream / downstream returns errors / times out / is slow? Are timeouts set? Are retries capped? Is there circuit breaking?
- Overload behavior — what happens as load ramps past capacity? Does the service shed, queue unboundedly, or cascade?
- Capacity & scaling — is the service horizontally scalable? What's the scaling unit? What are the known bottlenecks?
- Observability — are metrics / logs / traces in place? Are alerts routed to an owner? Are dashboards actionable?
- Runbooks & on-call — who owns this on-call? Is there a runbook for common incidents?
- Rollback / blast radius — is there a safe rollback path? What's the blast radius of a bad deploy?
The Zalando instantiation¶
Zalando applies PRR at fleet scale as a Cyber Week gate: "we formed a team of 10 colleagues, who were passionate about SRE and who signed up to perform production readiness reviews of our applications ahead of Cyber Week." (sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week). Implementation details exposed:
- Pre-PRR workshops — the reviewers first run a series of workshops "with teams to share knowledge about reliability patterns." PRR participants walk in with a baseline understanding of what good looks like.
- Clustering — "identified clusters of applications that required adjustments." PRR at fleet scale doesn't review 1,122 applications individually; it clusters by failure mode / dependency pattern / risk profile and reviews the cluster.
- Failure-type taxonomy — reviews explicitly check for "stability in case of various failure types (e.g. failures of dependencies, overload, timeouts)." A fixed small taxonomy is what makes a fleet-scale PRR tractable.
Why Cyber Week forces it¶
A PRR is expensive (reviewer time + service-team remediation time) and chronically deprioritized when there's no deadline. Cyber Week imposes a fixed date and a visible business risk, which makes the PRR calendar defensible. See patterns/annual-peak-event-as-capability-forcing-function for the general lever.
Distinguishing adjacent practices¶
- Pre-prod security review / threat model: focuses on adversarial behavior, not operational reliability.
- Code review: focuses on correctness and maintainability.
- Design review: up-front, pre-implementation.
- Game day: executes the PRR's hypotheses in production via induced failures. PRR is the paper audit; game days are the exercise.
Prerequisites¶
- A cohort of reviewers with enough SRE context to conduct the review — typically a Phase-1 / Phase-2 SRE org (concepts/sre-organizational-evolution).
- A reliability standards document / rubric that reviewers grade against — otherwise reviews drift into taste.
- A remediation track — a PRR that surfaces 100 issues with no path to fix them before the deadline fails at the second step, not the first.
Seen in¶
- sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week — Zalando's fleet-scale PRR as a Cyber Week pre-flight gate; 10-person grassroots reviewer cohort running workshops then cluster-level audits.
Related¶
- concepts/sre-organizational-evolution — PRR is a Phase-1 / Phase-2 deliverable.
- concepts/observability
- patterns/annual-peak-event-as-capability-forcing-function
- companies/zalando