Skip to content

PATTERN Cited by 1 source

Situation room for peak event

A situation room (or NOC / control center / war room) is a time-bounded, physically- or virtually-colocated observation post staffed during a peak event by representatives from key engineering teams, the SRE team, and dedicated Incident Commanders. Unlike routine on-call — which is reactive to pages — the situation room is observationally biased: operators watch dashboards, not wait for alerts.

The shape

Zalando's description:

"For the key period where we expect the highest load on our systems, we organize a Situation Room to ensure rapid incident response. In the room, we gather representatives from key engineering teams, SRE team, and dedicated Incident Commanders to closely watch the operational performance of our platform. It's basically a control center with dozens of screens and graphs."sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week

Three defining properties:

  1. Time-bounded. Active during the peak event window only (Cyber Week, Super Bowl, launch day). Not a standing function.
  2. Representation from key teams, not just on-call. Every critical service has a human present who owns that service, not just a pager rotation. Latency from detection to authoritative-decision is minutes, not hours.
  3. Observation-biased. The dashboards are watched in real time. Trends get noticed before they become alerts; decisions get made against the full picture rather than against a single firing rule.

What it is not

  • Not a replacement for on-call. The normal on-call rotation still exists. The situation room is an overlay for the peak window.
  • Not an incident-response room. Incident response happens in the situation room when something goes wrong, but the primary function of the room is proactive observation, not incident triage.
  • Not always physical. Zalando notes remote work forced a 2020 rethink: "This year has an added twist of remote working, which likely will require us to rethink how to organize the Situation Room efficiently." Virtual situation rooms work fine — a shared video call + shared dashboard views — but the coordination overhead has to be explicitly designed.

The roles

Typical staffing in the room:

  • Incident Commander(s) — dedicated, not doubled up on service work during the event. Their job is to coordinate response if something goes wrong: who's investigating what, who's communicating to the business, when to abort which mitigation.
  • Service representatives — one per critical service / domain. Can make authoritative decisions about their system in real time.
  • SRE team — owns the observability stack and can route between service reps when the dashboards suggest a cross-service problem.
  • Communications liaison — usually PM or exec sponsor; funnels status updates to the business without pulling engineers away from the dashboards.

The observable setup

  • Dozens of screens / dashboard panes. Zalando's 2019 physical setup explicitly used a multi-screen grid. The virtual equivalent is a structured multi-pane dashboard view with each pane owned by a different team's signals.
  • Pre-defined golden-signal dashboards. The dashboards aren't authored during the event; they're rehearsed. Typical cuts: request rate / error rate / latency p99 per service tier, orders/min as the commercial metric, saturation (CPU, memory, queue depth) for known bottlenecks, dependency health for third-party calls (payment providers, CDNs).
  • Traffic-source breakdown — see concepts/traffic-source-tagging-in-traces. During the event, App vs Web ratio is watched explicitly because its shift can indicate a client-side issue (an App release regression) that wouldn't show as a backend symptom.

Preconditions for the pattern to pay off

  • A genuine peak window — the pattern is expensive in engineer-hours (~10–30 people watching screens for hours). Not worth running for routine peaks or false ones. Cyber Week qualifies; a typical weekday does not.
  • Rehearsed dashboards. Building the observation surface during the event is too late. Zalando's PRR gate (concepts/production-readiness-review) and live load tests both produce the dashboards that get watched in the room.
  • Clear abort / mitigation levers. If nothing can be done from the situation room except filing tickets, the room is theatre. Documented levers: feature-flag rollbacks, traffic shedding, cache warming, capacity re-allocation.
  • Rules of engagement. Who can decide what in real time without escalation — especially around customer-visible mitigations like feature disable or traffic shed.

Distinguishing from game days

A game day is a rehearsal exercise pre-event; the situation room runs during the real event. They're paired: the game day is where the situation room's coordination muscles are built.

Seen in

Last updated · 476 distilled / 1,218 read