CONCEPT Cited by 1 source
Shadow-mode alert validation¶
Definition¶
Shadow-mode alert validation is the discipline of deploying a new alerting rule (or a new symptom source like an e2e test probe) in a lower-severity, non-paging channel — typically email or a team-only Slack channel — until its false-positive rate is empirically driven to zero, then promoting it to paging. It is a per-alert validation gate that runs in the production environment against real traffic / real production symptoms, using real-world flakiness as the iteration signal.
Three defining properties:
- Production-live signal, low-severity routing. The alert rule is fully wired against live data; it is not a test environment. The only thing different from full production deployment is the notification channel.
- Captured debugging artifacts on every fire. Every shadow-mode alert is treated as a failure case to investigate; HTML reports, traces, logs, videos are preserved to drive iteration.
- Explicit exit criterion. Promotion to paging is gated on "we stopped getting alerts in shadow mode" — i.e. a measurable period with zero false positives, not a time-in-shadow milestone.
Why the gate is mandatory¶
For every new alerting source, the two worst outcomes are:
- Pager fatigue from false positives — on-call learns to ignore the channel, eventually misses a real incident. See concepts/alert-fatigue.
- Silenced alerts that hide real issues — the alert fires, nobody looks, the problem festers.
Shadow mode defuses both. Low-severity routing means the alert can fire N times a day without paging anyone at 3 AM; the email bucket can be reviewed during business hours, and each trigger informs the next iteration. The explicit exit criterion stops the team from promoting prematurely.
The iteration loop¶
Zalando's concrete shadow-mode iteration cycle (sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright):
┌──────────────────────────────┐
│ Shadow-mode trigger │
│ (email-only, low severity) │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Capture debugging artifact │
│ (HTML report, trace, logs) │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Diagnose root cause │
│ (real problem vs flake) │
└──────────────┬───────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Real problem Selector flake Timing flake
│ │ │
▼ ▼ ▼
Fix / notify Tighten selector Add expect.toPass
(data-testid, retry, :visible
role-based) pseudo-class
│
▼
┌──────────────────────────────┐
│ Wait for next shadow │
│ window (e.g. a weekend) │
└──────────────┬───────────────┘
│
▼
Zero triggers? → Promote to paging
Still triggering? → Iterate
Zalando's observed time-to-zero for the Playwright probe suite: "a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed."
Comparison to adjacent deployment gates¶
| Gate | What's dark | Use case |
|---|---|---|
| Feature flag | User-visible behaviour | Product launch |
| Dark deploy | Code is live but unreachable | Infra migration |
| Shadow traffic | Duplicate requests to new backend | Backend cutover |
| Canary | Small traffic percentage to new version | Gradual rollout |
| Shadow-mode alert | Only notification channel | New alerting source |
Shadow-mode alert validation is the alerting-system analogue of canary deployment: deploy fully, route differently, validate, promote. It shares DNA with shadow-traffic-based backend cutover (production-live signal, observer-only side effect).
When to skip shadow mode¶
The gate is cheap but not zero-cost; some alert sources don't need it:
- Alerts derived from existing well-calibrated alerts (same metric, new threshold) — the upstream signal is already validated.
- Alerts with well-understood mathematical definitions (e.g. MWMBR burn-rate alerts on a metric with a measured noise distribution).
Every alerting source with a novel runtime path — especially one with real-browser / real-network dependencies — should go through shadow mode.
Exit-criterion nuance¶
"Zero alerts in shadow mode" is cleaner than "N weeks without an alert" because flakiness is often concentrated around low-traffic windows (weekend, off-peak). Zalando explicitly named weekend as the trigger-heavy window: "it did trigger a couple of times, especially over the weekend." A time-in-shadow criterion that didn't span a full weekend would miss the weekend-specific flake class.
Dependencies¶
- Multi-severity alerting system — the channel split (email-only vs page) is a hard prerequisite.
- Per-alert artifact capture — HTML report, trace, video, logs. Without this the shadow-mode fire is unactionable.
- Explicit ownership — someone reviews the email bucket daily; otherwise shadow mode is just silence.
- Discipline to hold the line — the scenario stays in shadow mode until truly clean, even if that takes weeks.
Seen in¶
- sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright
— canonical wiki instance. Zalando promotes three
Playwright e2e test probes (home→product, catalog→filter
→product, product→checkout) to paging only after a
multi-week email-only shadow-mode validation that drove
false-positive rate to zero. Iteration signal: per-
trigger HTML reports + traces; fixes were selector
improvements + local
expect.toPassretries +:visiblepseudo-class augmentations. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident).
Related¶
- concepts/alert-fatigue — the primary motivating failure mode.
- concepts/end-to-end-test-probe — canonical alert source that needs this gate.
- concepts/flaky-test — the noise source that shadow mode surfaces and lets the team iterate against.
- concepts/symptom-based-alerting — Zalando's broader alerting strategy; shadow-mode validation is how new symptom sources are onboarded.
- patterns/shadow-mode-alert-before-paging — the pattern form.
- patterns/e2e-test-as-synthetic-probe — the pattern whose canonical deployment shape includes this gate.