PATTERN Cited by 1 source
Shadow-mode alert before paging¶
What this is¶
Shadow-mode alert before paging is the pattern of deploying any new alerting source — a new monitor, a new probe, a new symptom rule — in an email-only (or Slack- only) low-severity channel first, iterating on its false- positive rate against live production data, and only promoting it to paging after the false-positive rate observably drops to zero.
It is an alerting-system analogue of canary deployment: fully wired to production data, but observer-only in notification terms, validated empirically before promotion.
Why¶
New alerting sources — especially those with real-browser, real-network, or real-third-party dependencies — cannot be reasoned about in isolation. Their false-positive rate is an empirical property of the production environment they run against. Two failure modes happen without this gate:
- Pager fatigue — early false positives train on-call to ignore the channel. See concepts/alert-fatigue.
- Silent miss — the team silences a noisy alert, loses the signal, fails to catch a real incident later.
Shadow mode defuses both: the alert fires fully against production, but into an email bucket that can tolerate noise without burning out humans. Each trigger becomes an iteration signal rather than a 3 AM page.
Shape (Zalando instantiation)¶
Source: sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright:
- Deploy the alert rule fully against live production data. Same metric, same threshold, same evaluation window as the final config.
- Route to low-severity channel only. Zalando used an email to the team. No pager.
- Capture debugging artifacts on every fire. Playwright HTML reports + traces + videos, per-trigger, preserved for post-hoc diagnosis.
- Iterate on each trigger. For every shadow-mode fire:
- Diagnose: real problem vs selector flake vs timing flake.
- Fix: tighten selector, add local
expect.toPassretry, use:visiblepseudo-class for non-visible content, adjust threshold.
- Wait for zero-trigger windows. Zalando explicitly called out weekends as the trigger-heavy window: "it did trigger a couple of times, especially over the weekend."
- Promote only when the trigger stream goes silent. "After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed." Not a time-in-shadow target; an observed zero-trigger interval.
- Keep the shadow channel live after promotion (not required, but recommended) so new regressions can resurface the team's feedback loop.
Preconditions¶
- Multi-severity alerting system with at least two notification channels (email/Slack and pager).
- Per-trigger artifact capture — without this, each shadow-mode fire is unactionable.
- Explicit ownership — someone reviews the low-severity bucket on a cadence; otherwise it's silence.
- Discipline to hold the line — the alert stays in shadow mode until it's actually clean, even when that takes longer than the team expects.
When to use¶
- New alerting sources with real-world dependencies (browsers, third-party APIs, CDN cache behaviour, hydration timing).
- New e2e test probes — canonical shape in the Zalando 2024 post.
- Refactored alert thresholds on existing rules where the new threshold's noise distribution is unknown.
- Symptom alerts derived from new metrics — especially when the metric is a novel derived quantity (latency percentile on a new aggregation, burn-rate on a new SLO window).
When to skip¶
- Alerts derived from already-well-calibrated alerts (same metric, same noise distribution, trivially different threshold).
- Alerts with rigorously-defined math (MWMBR burn rates on a metric with measured noise) — though even here, a short shadow period usually catches edge cases.
Contrast¶
- Feature flag — controls user-visible behaviour, not notification routing.
- Dark deploy — code live but unreachable; pattern here is code fully reachable, notification dark.
- Controlled rollout with traffic ramp-up — percentage rollout; this pattern is binary on severity routing.
Anti-patterns¶
- Fixed-duration shadow — "two weeks in shadow then promote" without zero-trigger verification promotes alerts that are still flaky. Zalando's criterion is observed silence, not elapsed time.
- Skipping artifact capture — a shadow fire without HTML report / trace / logs is a lost iteration opportunity; the alert will fire again with the same root cause still undiagnosed.
- Unowned low-severity bucket — if nobody reads the email, shadow mode is silence, and the pattern degenerates into a deployment delay.
Seen in¶
- sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright
— canonical wiki instance. Zalando promotes three
Playwright e2e test probes to paging only after a
multi-week email-only shadow-mode validation. Per-trigger
HTML reports + traces drove iteration; fixes landed as
tighter selectors,
expect.toPassretries,:visiblepseudo-class augmentations. Final post- promotion false-positive rate: 0 % (only pager firing was on a real incident).
Related¶
- concepts/shadow-mode-alert-validation — the concept form.
- concepts/alert-fatigue — the motivating failure mode.
- concepts/flaky-test — the noise source the shadow-mode iteration loop addresses.
- concepts/end-to-end-test-probe — canonical alert source that requires this pattern.
- concepts/symptom-based-alerting — the alerting strategy this pattern extends.
- patterns/e2e-test-as-synthetic-probe — the pattern whose deployment includes this gate.
- patterns/controlled-rollout-with-traffic-rampup — adjacent rollout pattern at a different dimension.