CONCEPT Cited by 2 sources
Alert fatigue¶
The operator-side failure mode in which a notification channel emits so many alerts — especially low-signal / duplicate / flapping — that operators stop treating each alert as meaningful. Once fatigue sets in, real incidents are missed in the noise, and trust in the alerting system collapses.
Causes¶
- Duplicate notifications per condition. Same underlying problem fires N times per minute as the condition persists, instead of firing once + being updated.
- Flapping / transient conditions alerting at high fidelity. Conditions that resolve on their own before a human could act still page.
- No SLA / severity gating. Everything pages the same way; the channel that should trigger action has nothing distinguishing it from the background hum.
- No auto-resolution. Alerts stay open until manually cleared, teaching operators that open alerts are normal.
- Coverage without curation. Alerts added eagerly as the system grows, never pruned; the per-alert cost is hidden because it is paid by the on-call.
Structural fixes¶
- Aggregate per entity — patterns/alarm-aggregation-per-entity; append to an existing open record rather than opening a new one per detection.
- Multi-stage validation before alarming — patterns/multilayered-alarm-validation; require persistence + confidence + consistency checks to cross distinct thresholds before emitting.
- Backtest new alerts against historical data — patterns/alert-backtesting; compute noisiness metrics + firing-count timelines before shipping.
- Auto-close on absence — scheduled job checks whether the condition still exists; if not, close without operator action.
- SLA + escalation — severity determines channel + cadence.
Named sources¶
- sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai — explicit framing: "This function intelligently aggregates risks per camera per use case to avoid alert fatigue. Instead of bombarding safety teams with duplicate notifications, the system appends new occurrences to existing open risks." The full remediation stack — multilayered validation pre-alarm + per-entity aggregation post-alarm + auto-close on absence + SLA escalation — is built specifically around the fatigue concern.
- sources/2026-03-04-airbnb-alert-backtesting-change-reports — Airbnb's retrospective frames what looked like an alert-hygiene culture problem as a workflow problem: once engineers could backtest + see noisiness numbers at PR time, "engineers stopped tolerating noisy alerts and started competing to improve them." 90% reduction in company-wide alert noise after tooling landed.
Related¶
- concepts/observability — the broader surface this concept sits inside.
- patterns/alarm-aggregation-per-entity + patterns/multilayered-alarm-validation
- patterns/alert-backtesting — named structural fixes.