Skip to content

PATTERN Cited by 1 source

Shadow-mode alert before paging

What this is

Shadow-mode alert before paging is the pattern of deploying any new alerting source — a new monitor, a new probe, a new symptom rule — in an email-only (or Slack- only) low-severity channel first, iterating on its false- positive rate against live production data, and only promoting it to paging after the false-positive rate observably drops to zero.

It is an alerting-system analogue of canary deployment: fully wired to production data, but observer-only in notification terms, validated empirically before promotion.

Why

New alerting sources — especially those with real-browser, real-network, or real-third-party dependencies — cannot be reasoned about in isolation. Their false-positive rate is an empirical property of the production environment they run against. Two failure modes happen without this gate:

  • Pager fatigue — early false positives train on-call to ignore the channel. See concepts/alert-fatigue.
  • Silent miss — the team silences a noisy alert, loses the signal, fails to catch a real incident later.

Shadow mode defuses both: the alert fires fully against production, but into an email bucket that can tolerate noise without burning out humans. Each trigger becomes an iteration signal rather than a 3 AM page.

Shape (Zalando instantiation)

Source: sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright:

  1. Deploy the alert rule fully against live production data. Same metric, same threshold, same evaluation window as the final config.
  2. Route to low-severity channel only. Zalando used an email to the team. No pager.
  3. Capture debugging artifacts on every fire. Playwright HTML reports + traces + videos, per-trigger, preserved for post-hoc diagnosis.
  4. Iterate on each trigger. For every shadow-mode fire:
    • Diagnose: real problem vs selector flake vs timing flake.
    • Fix: tighten selector, add local expect.toPass retry, use :visible pseudo-class for non-visible content, adjust threshold.
  5. Wait for zero-trigger windows. Zalando explicitly called out weekends as the trigger-heavy window: "it did trigger a couple of times, especially over the weekend."
  6. Promote only when the trigger stream goes silent. "After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed." Not a time-in-shadow target; an observed zero-trigger interval.
  7. Keep the shadow channel live after promotion (not required, but recommended) so new regressions can resurface the team's feedback loop.

Preconditions

  • Multi-severity alerting system with at least two notification channels (email/Slack and pager).
  • Per-trigger artifact capture — without this, each shadow-mode fire is unactionable.
  • Explicit ownership — someone reviews the low-severity bucket on a cadence; otherwise it's silence.
  • Discipline to hold the line — the alert stays in shadow mode until it's actually clean, even when that takes longer than the team expects.

When to use

  • New alerting sources with real-world dependencies (browsers, third-party APIs, CDN cache behaviour, hydration timing).
  • New e2e test probes — canonical shape in the Zalando 2024 post.
  • Refactored alert thresholds on existing rules where the new threshold's noise distribution is unknown.
  • Symptom alerts derived from new metrics — especially when the metric is a novel derived quantity (latency percentile on a new aggregation, burn-rate on a new SLO window).

When to skip

  • Alerts derived from already-well-calibrated alerts (same metric, same noise distribution, trivially different threshold).
  • Alerts with rigorously-defined math (MWMBR burn rates on a metric with measured noise) — though even here, a short shadow period usually catches edge cases.

Contrast

  • Feature flag — controls user-visible behaviour, not notification routing.
  • Dark deploy — code live but unreachable; pattern here is code fully reachable, notification dark.
  • Controlled rollout with traffic ramp-up — percentage rollout; this pattern is binary on severity routing.

Anti-patterns

  • Fixed-duration shadow"two weeks in shadow then promote" without zero-trigger verification promotes alerts that are still flaky. Zalando's criterion is observed silence, not elapsed time.
  • Skipping artifact capture — a shadow fire without HTML report / trace / logs is a lost iteration opportunity; the alert will fire again with the same root cause still undiagnosed.
  • Unowned low-severity bucket — if nobody reads the email, shadow mode is silence, and the pattern degenerates into a deployment delay.

Seen in

  • sources/2024-07-18-zalando-end-to-end-test-probes-with-playwrightcanonical wiki instance. Zalando promotes three Playwright e2e test probes to paging only after a multi-week email-only shadow-mode validation. Per-trigger HTML reports + traces drove iteration; fixes landed as tighter selectors, expect.toPass retries, :visible pseudo-class augmentations. Final post- promotion false-positive rate: 0 % (only pager firing was on a real incident).
Last updated · 501 distilled / 1,218 read