Skip to content

PATTERN Cited by 1 source

Preemptive low-severity incident for potential impact

The pattern

Declare a low-severity incident (SEV4 / SEV5) before any customer impact is observed, on the basis of elevated risk from an external event — "in preparation for the worst." The declaration creates a shared coordination channel, documentation surface, and timeline in advance of potential customer harm, so if the harm materialises, incident response is already bootstrapped.

Canonical verbatim (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):

"At this point, it was clear that multiple GCP services were experiencing a global outage, despite not having received support tickets from our customers or being paged by Redpanda Cloud alerts. So, in preparation for the worst, we preemptively created a low-severity incident to coordinate the response to multiple potential incidents."

The decision at 19:08 UTC — 27 minutes after being notified of the GCP outage by the GCP TAM, with no customer tickets and no internal alerts — is the load-bearing instance.

Why declare preemptively

  1. Coordination surface from t=0. The incident doc, Slack channel, and commander role are ready before the first real signal arrives — no scramble.
  2. Multi-customer / multi-signal coordination. Where a single potential incident would be handled case-by-case, an N-potential-incident scenario (one per customer or region) needs a shared context to avoid duplicate investigation.
  3. Observability preserved as timeline. Timeline reconstruction post-incident is easier if incident data (chat log, actions taken, decisions made) was captured in real-time rather than reconstructed.
  4. Psychological primer. On-call staff shift from routine-ops mode to incident-response mode earlier; faster responses if customer-impact does materialise.
  5. SEV4 is cheap. Low-severity incidents don't page executives or trigger external communication; the cost of opening one is near-zero.

When to declare preemptively

The trigger is elevated probability of customer impact, not confirmed impact. Examples:

  • Cloud-provider global outage announcement. Your dependency tier is affected; downstream customer impact is plausible but not yet observed.
  • Third-party vendor outage of a critical-path dependency (payment gateway, identity provider, DNS).
  • Known-bad deployment in progress. A rollback is underway after smoke-test failure; wider impact is possible.
  • Observable-anomaly without confirmed user harm. Error rates up on internal metrics but no customer tickets or alerts yet.
  • Regional infrastructure event (power outage, natural disaster, network partition) that might degrade service.

Severity-level discipline

The pattern is coupled to a severity-ladder where:

  • SEV4 / SEV5 = low-priority, no executive escalation, no customer comms, no paging beyond on-call engineer. Cheap to open, cheap to keep open, cheap to close.
  • SEV3 = confirmed customer impact, investigation underway.
  • SEV2 = confirmed widespread impact, customer comms started.
  • SEV1 = full outage, all-hands response.

A preemptive SEV4 can escalate to SEV3 / SEV2 if the risk materialises. Conversely, it can close at SEV4 with no action needed if the risk dissipates — as in the 2025-06-12 Redpanda instance where the incident closed at SEV4 with no customer impact.

Variants

  • Watch mode — declare preemptive SEV4 but take no action; just observe and coordinate if escalation needed.
  • Prepositioned mitigation — declare preemptive SEV4 and pre-stage mitigations (e.g., warm up secondary regions, prepare DNS failover) to reduce latency if escalation comes.
  • Customer-facing preemptive status — some orgs post status-page "Investigating" entries on public dashboards during preemptive SEV4, trading transparency for potential false-alarm cost.

Anti-patterns

  • Declaring preemptive SEV3 or higher. Higher severities have escalation costs (pages, executive involvement) that are inappropriate for unconfirmed risk; false-alarm tolerance drops rapidly at SEV3+.
  • Failing to close preemptive SEVs. If the risk dissipates and the SEV stays open, alert fatigue sets in and the pattern loses operational discipline.
  • No incident command structure at SEV4. If SEV4 doesn't name a commander, the coordination value of the pattern is lost.
  • No post-incident review for closed preemptive SEVs. Even if impact never materialised, the data from the near-miss (which was the risk, which controls worked, which didn't) is load-bearing for future calibration.

Redpanda timeline context

The 19:08 UTC preemptive SEV4 was the first element of a sequenced response:

Time (UTC) Event
18:41 GCP TAM notification
18:42 Impact assessment began
18:43 Observed degraded monitoring (third-party vendor partial outage)
19:08 Preemptive SEV4 declared
19:23 Cloud-marketplace vendor reported issues
19:41 Google identified root cause
20:26 Delayed alert notifications arrived
20:56 Proactive customer outreach began
21:38 Incident considered mitigated (severity unchanged at SEV4)

The preemptive declaration bought 58 minutes of preparation time before the first observable impact (20:26 alerts). During that window the team was organised rather than scrambling.

Caveats

  • Pattern requires a culture that doesn't penalise false alarms. If SEV4 closures without incident are seen as "crying wolf," teams will stop declaring preemptively and lose the value.
  • Severity taxonomy must exist. SEV4 must be well-defined; some orgs conflate all SEVs into one escalation path.
  • Not a substitute for good monitoring. Preemptive SEVs work best when paired with observability that will confirm real impact — otherwise the team is flying blind.
  • Preemptive-SEV declaration can itself be a page. If the pattern costs engineer attention during declaration, overuse is wasteful.
  • Customer communication policy must be explicit. Preemptive SEVs that leak to customer comms can create reputational risk for risk that never materialises.

Seen in

Last updated · 470 distilled / 1,213 read