PATTERN

Shadow-mode alert before paging¶

What this is¶

Shadow-mode alert before paging is the pattern of deploying any new alerting source — a new monitor, a new probe, a new symptom rule — in an email-only (or Slack- only) low-severity channel first, iterating on its false- positive rate against live production data, and only promoting it to paging after the false-positive rate observably drops to zero.

It is an alerting-system analogue of canary deployment: fully wired to production data, but observer-only in notification terms, validated empirically before promotion.

Why¶

New alerting sources — especially those with real-browser, real-network, or real-third-party dependencies — cannot be reasoned about in isolation. Their false-positive rate is an empirical property of the production environment they run against. Two failure modes happen without this gate:

Pager fatigue — early false positives train on-call to ignore the channel. See concepts/alert-fatigue.
Silent miss — the team silences a noisy alert, loses the signal, fails to catch a real incident later.

Shadow mode defuses both: the alert fires fully against production, but into an email bucket that can tolerate noise without burning out humans. Each trigger becomes an iteration signal rather than a 3 AM page.

Shape (Zalando instantiation)¶

Source: :

Deploy the alert rule fully against live production data. Same metric, same threshold, same evaluation window as the final config.
Route to low-severity channel only. Zalando used an email to the team. No pager.
Capture debugging artifacts on every fire. Playwright HTML reports + traces + videos, per-trigger, preserved for post-hoc diagnosis.
Iterate on each trigger. For every shadow-mode fire:
- Diagnose: real problem vs selector flake vs timing flake.
- Fix: tighten selector, add local expect.toPass retry, use :visible pseudo-class for non-visible content, adjust threshold.
Wait for zero-trigger windows. Zalando explicitly called out weekends as the trigger-heavy window: "it did trigger a couple of times, especially over the weekend."
Promote only when the trigger stream goes silent. "After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed." Not a time-in-shadow target; an observed zero-trigger interval.
Keep the shadow channel live after promotion (not required, but recommended) so new regressions can resurface the team's feedback loop.

Preconditions¶

Multi-severity alerting system with at least two notification channels (email/Slack and pager).
Per-trigger artifact capture — without this, each shadow-mode fire is unactionable.
Explicit ownership — someone reviews the low-severity bucket on a cadence; otherwise it's silence.
Discipline to hold the line — the alert stays in shadow mode until it's actually clean, even when that takes longer than the team expects.

When to use¶

New alerting sources with real-world dependencies (browsers, third-party APIs, CDN cache behaviour, hydration timing).
New e2e test probes — canonical shape in the Zalando 2024 post.
Refactored alert thresholds on existing rules where the new threshold's noise distribution is unknown.
Symptom alerts derived from new metrics — especially when the metric is a novel derived quantity (latency percentile on a new aggregation, burn-rate on a new SLO window).

When to skip¶

Alerts derived from already-well-calibrated alerts (same metric, same noise distribution, trivially different threshold).
Alerts with rigorously-defined math (MWMBR burn rates on a metric with measured noise) — though even here, a short shadow period usually catches edge cases.

Contrast¶

Feature flag — controls user-visible behaviour, not notification routing.
Dark deploy — code live but unreachable; pattern here is code fully reachable, notification dark.
Controlled rollout with traffic ramp-up — percentage rollout; this pattern is binary on severity routing.

Anti-patterns¶

Fixed-duration shadow — "two weeks in shadow then promote" without zero-trigger verification promotes alerts that are still flaky. Zalando's criterion is observed silence, not elapsed time.
Skipping artifact capture — a shadow fire without HTML report / trace / logs is a lost iteration opportunity; the alert will fire again with the same root cause still undiagnosed.
Unowned low-severity bucket — if nobody reads the email, shadow mode is silence, and the pattern degenerates into a deployment delay.

Seen in¶

— canonical wiki instance. Zalando promotes three Playwright e2e test probes to paging only after a multi-week email-only shadow-mode validation. Per-trigger HTML reports + traces drove iteration; fixes landed as tighter selectors, expect.toPass retries, :visible pseudo-class augmentations. Final post- promotion false-positive rate: 0 % (only pager firing was on a real incident).

concepts/shadow-mode-alert-validation — the concept form.
concepts/alert-fatigue — the motivating failure mode.
concepts/flaky-test — the noise source the shadow-mode iteration loop addresses.
concepts/end-to-end-test-probe — canonical alert source that requires this pattern.
concepts/symptom-based-alerting — the alerting strategy this pattern extends.
patterns/e2e-test-as-synthetic-probe — the pattern whose deployment includes this gate.
patterns/controlled-rollout-with-traffic-rampup — adjacent rollout pattern at a different dimension.