Skip to content

PATTERN Cited by 1 source

Weekly Operational Review

Problem

Reliability compounds only if failures are learned from across the organisation, not just fixed in the service that produced them. The default pattern — each team writes an individual postmortem, files it, moves on — lets the same failure shape recur in an adjacent service six months later. On-call engineers burn out firefighting the same category of incident.

Meanwhile small wins — a gnarly alert that was quieted, a cross-team ticket that was closed — are invisible to anyone who didn't work on them, so the organisation doesn't learn from success either.

Pattern

Institute a weekly cadence where the people closest to operations — engineers, on-calls, product managers, and leaders — step out of that week's firefighting to review the system with rigour. Two equally weighted agendas:

  1. Celebrate small wins that quietly made the system safer: automation that removed a category of toil, an alert that now classifies itself, a config knob turned into a self-service lever.
  2. Dig into failures — not just what happened but how we make sure it doesn't happen again anywhere, including services that didn't experience the incident this time but share the same failure shape.

MongoDB's articulation:

"Every week, the people closest to the work — engineers, on-calls, product managers, and leaders — step out of the day's firefight to review the system with rigor. We celebrate the small wins that quietly make the system safer. We dig into failures to understand not just what happened, but how to make sure it doesn't happen again anywhere."

"The goal isn't perfection. Instead, it's building a system where every lesson learned and every fix made raises the floor for everyone. A single automation can remove a whole category of incidents. A well-written postmortem can stop the same mistake from happening across dozens of systems. The return isn't linear — it compounds."

(Source: sources/2025-09-25-mongodb-carrying-complexity-delivering-agility)

Why weekly

  • Incidents aren't monthly — they're daily-to-weekly. A monthly cadence creates a 3-to-4-week gap where learning stales. Weekly keeps the memory fresh and pairs follow-up actions to the same week's on-call rotation.
  • Short enough that it doesn't eat the calendar. A 60-90 minute weekly meeting with a tight template is sustainable in a way a 4-hour quarterly deep-dive is not.
  • Long enough for cross-team transfer. Beyond a single team's ops sync: the weekly operational review is explicitly cross-service so a lesson from service A reaches service B while the lesson is still live.

What compounds

MongoDB frames the return as compounding because:

  • Automation replaces humans on a whole class — one automation removes an entire category of future incidents.
  • Postmortems generalise — a well-written postmortem of one incident prevents the same mistake across "dozens of systems." This is the structural cousin of patterns/upstream-the-fix (fix the shared cause, not the instance).
  • Small wins are visible — celebrating a quiet automation signals to engineers that this kind of work is valued and recognised, which changes what gets prioritised.

Implementation guidance

From other operations-culture practices on this wiki, the lightweight realisation looks like:

  • Single hour per week, same time, whole team + leadership present. Recurring standing meeting, not a retro.
  • Agenda template. 1) Key metrics (SLOs, error budgets, alert load). 2) Wins worth celebrating (one-line each). 3) Failures worth digging into (2-3 per week max; don't try to cover everything). 4) Cross-team action items with owners + deadlines.
  • Pre-read discipline. Postmortems linked in the agenda 24 hours in advance; the review is for discussion of the generalisable lesson, not the first read.
  • Follow-through tracked. Action items from last week's review are the first item next week — no action item ages silently.

Relationship to sibling patterns

  • patterns/upstream-the-fix — the implementation discipline the review drives. The review identifies the shared cause; upstream-the-fix is what the resulting action-item looks like.
  • patterns/platform-engineering-investment — platform teams whose mandate is "remove categories of toil from product teams" depend on the review as the input signal for what categories to attack.
  • concepts/alert-fatigue — weekly review is one answer to the organisational side of alert fatigue: if an alert fires repeatedly, this is the forum to discuss killing or aggregating it.
  • concepts/observability — the review feeds back into what metrics / logs / traces are missing; gaps surface when you try to diagnose a failure and can't.

Failure modes

  • Becomes a status meeting. If leadership uses the time to report to leadership instead of to listen, engineers disengage. Explicit rule: 80%+ of talk-time is the engineers / on-calls.
  • Too many incidents to cover. If the team is producing more incidents than one hour can discuss, the review isn't scaling — and that's itself a signal that something structural (load, architecture, staffing, on-call rotation) needs attention.
  • No action-item closure loop. The review produces action items; without a concrete owner + deadline + next-week check-back, the review is theatre.
  • Success-blindness. If only failures are reviewed and never wins, the organisation trains engineers to avoid visibility — visible work becomes the work that failed. Celebrate-small-wins is not optional.

Seen in

Last updated · 200 distilled / 1,178 read