CONCEPT Cited by 1 source
Anomaly vs incident separation¶
Anomaly vs incident separation is an incident-management process pattern in which alerts that reach on-call engineers are triaged into two distinct tracks:
- Anomaly — a real signal but not (yet) user-impacting: investigate, possibly tune the alert or file a ticket, no responder engagement, no postmortem.
- Incident — user impact confirmed: engage responders, mitigate, follow incident process, postmortem.
The separation cuts false-positive ceremony cost without dropping the signal, and is complementary to concepts/symptom-based-alerting and concepts/multi-window-multi-burn-rate.
The problem¶
Traditional incident processes treat every page as an incident. This conflates two very different things:
- "Error rate on /checkout spiked for 30 seconds then recovered" — no user impact, likely a transient network event or a cold start. Historically, this still triggered incident ceremony (war room, status update cadence, postmortem obligation).
- "Error rate on /checkout has been elevated for 10 minutes, users are seeing failed orders" — real incident requiring coordinated response.
Running both through the same process means real incidents compete for attention with anomalies, and on-call engineers pay postmortem-writing overhead on noise.
The separation¶
Zalando names the split:
"We took it a step further and devised a new incident process that separated Anomalies and Incidents." — sources/2021-10-14-zalando-tracing-sres-journey-part-iii
The post doesn't fully disclose the bureaucratic mechanics, but the implied distinction:
| Aspect | Anomaly | Incident |
|---|---|---|
| User impact | None / unconfirmed | Confirmed |
| Response | Investigate (by responder or async) | Engage incident responders |
| Coordination | None | Incident Commander, status updates |
| Post-event | Ticket, tune alert, maybe root-cause note | Postmortem obligation |
| Drives KPIs | False positive rate | Incident count, MTTR, customer impact |
How it composes with other alerting primitives¶
Anomaly/incident separation sits downstream of alerting primitives, not instead of them:
- concepts/symptom-based-alerting — alert on user- visible symptoms, not on internal causes, to reduce false-positive-by-construction.
- concepts/multi-window-multi-burn-rate — derive alert thresholds from error budget burn rate rather than raw error rate, to reduce false- positive-by-spike.
- Anomaly vs incident separation (this concept) — for the alerts that still fire and turn out not to be user- impacting, reduce ceremony cost and separate the metrics.
The three layers progressively reduce noise and the cost of noise.
KPI implications¶
Separating anomalies from incidents makes the concepts/sre-kpi-portfolio measurable distinctly:
- False positive rate — anomalies-that-were-paged / total-pages. Directly actionable for alerting tuning.
- Incident count — only counts confirmed user-impact events. Doesn't balloon when alert noise increases.
- MTTR — average time to resolve real incidents. Without the separation, MTTR is polluted by 30-second anomalies with zero mitigation time.
- Customer impact — measurable only on incidents, by construction.
Without the separation, these KPIs all drift together when alert noise changes, making it impossible to tell whether the reliability program is improving or degrading.
Interaction with on-call health¶
Running incident ceremony on anomalies is a direct on-call health tax — postmortem writing, status updates, and follow-up action items on noise events consume responder time without improving reliability. Anomaly/incident separation is therefore both an alerting-quality lever and an on-call- health lever.
Caveats¶
- Classification is not always fast. Some events start as anomalies and escalate to incidents as impact becomes visible. The process must support promotion from anomaly → incident without losing context.
- Under-classification risk. If the process makes it too easy to close an event as "anomaly", real low-level incidents get buried. Needs audit — some fraction of anomalies should be re-reviewed.
- Doesn't replace alert-quality work. If the anomaly bucket is growing, the right answer is upstream: fix the alert, not normalise the anomaly ceremony.
Seen in¶
- sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's SRE department defines a new incident process in 2020 that separates Anomalies and Incidents, complementary to Symptom Based Alerting, and cites reduced false-positive rate and on-call engineer health as the outcomes.