CONCEPT Cited by 1 source
Symptom-based alerting¶
Symptom-based alerting is the alerting strategy in which the primary alerting rules fire on user-visible symptoms (a business operation's error rate or latency SLO breach, checkout failure rate, search availability) rather than on internal cause metrics (CPU, disk, individual-service error rate, queue depth). Cause metrics still exist for diagnosis but are not what pages the on-call.
Definition¶
"This alert handler was also a game changer in our push for a different alerting strategy: Symptom Based Alerting." — sources/2021-09-20-zalando-tracing-sres-journey-part-ii
The strategy has three pillars:
- Alert the user's experience. If the user doesn't see a degradation, the machine is allowed to be unhealthy. If the user sees a degradation, someone pages — regardless of whether any individual subsystem looks broken.
- Keep cause metrics as diagnostic panes. CPU, disk, lock contention, queue depth — all still collected and graphed, but they're for debugging after the page, not for firing the page.
- Per-user-journey / per-business-operation granularity. The symptom is measured at the Critical Business Operation altitude, not per service.
Why symptom-based alerting outperforms cause-based at scale¶
Classical cause-based alerting (alert on CPU, on service error rate, on disk utilisation, on individual RPCs) fails in three ways at fleet scale:
- Per-service alert sprawl. 4,000 services × 10 rules each = 40,000 alerting rules. Every rule is a chance for noise, a maintenance burden, and a stale-threshold risk. See concepts/alert-fatigue.
- Local health ≠ global health. Every component can look green while the compound user journey is breaking. A multi-service user journey at 3 failed services × 0.5% error rate each yields ~1.5% compound error rate — and no individual service's cause-alert fires.
- Cause space is open-ended. Services have bugs you've never seen; a cause-metric alert set will always have coverage gaps. Symptom alerts catch unknown combinations for free: the symptom fires whenever the user is degraded, whatever the root cause.
The core tradeoff: symptom-based alerting is more robust but provides less immediate diagnostic signal. You page on a CBO error rate of 2%, and the on-call has to figure out why — so the strategy depends on fast root-cause tooling being in place.
The "who do we page" gap¶
Symptom-based alerting surfaces a routing problem: if the symptom is "checkout error rate is up" and 15 services touch checkout, who gets paged? Naive resolutions — pager the CBO owner, pager the on-call SRE rotation — either load one team with all incidents or impose an indirection hop. Neither scales.
Zalando's answer: Adaptive Paging pairs with symptom-based alerting to close the loop: use the live distributed-tracing graph at alert time to identify the service whose span is the likely root cause, and page that team directly. The two primitives are explicitly co-designed in the Zalando framing:
"This alert handler [adaptive paging] was also a game changer in our push for a different alerting strategy: Symptom Based Alerting."
Google SRE book lineage¶
Symptom-based alerting is Google SRE-book orthodoxy:
"The most effective alerts are directly tied to symptoms of a problem." — Google SRE Book, Ch. 4 Service Level Objectives; Ch. 10 Practical Alerting from Time-Series Data.
Zalando's novelty is not the philosophy — it's the platform substrate that makes it feasible: fleet-wide distributed tracing + a CBO taxonomy + adaptive paging are the load-bearing prerequisites that let a 7-person SRE team roll symptom-based alerting out company-wide instead of as a per-team opt-in.
Prerequisites¶
- A defined CBO taxonomy — what counts as a user-facing operation worth alerting on?
- Fleet-wide distributed tracing — for both the CBO-level error-rate measurement (root-span aggregation) and for adaptive paging's routing decision.
- Fast root-cause tooling — because the page no longer tells you what's broken, the investigation stack (traces, logs, dashboards keyed by trace ID) must be <5-minute onboardable.
- An ownership map — "service X is owned by team Y" so adaptive paging can translate the identified span into a team.
Anti-patterns¶
- Keep cause alerts "just in case." The drift back toward cause-based alerting is high because cause alerts feel more specific. Guard by explicitly retiring service-level alerts once the CBO-level alert covers the same failure mode.
- No error budget / SLO backing the alert threshold. A symptom-based alert with arbitrary thresholds ("page if error rate > 0.1%") still produces fatigue. The threshold should come from the CBO's SLO + error-budget policy, not from intuition. See concepts/operation-based-slo.
- Symptom alerts without fast diagnostic tooling. If your on-call gets paged with "checkout is broken" and has to grep logs for 30 minutes to figure out which subsystem, the strategy fails on MTTR.
Seen in¶
- sources/2021-09-20-zalando-tracing-sres-journey-part-ii — Zalando's 2019 alerting-strategy shift, explicitly paired with Adaptive Paging as the routing solution. Public slides at github.com/zalando/public-presentations.
Related¶
- concepts/symptom-vs-cause-metric — the underlying metric- design distinction this strategy is built on.
- concepts/adaptive-paging — the routing primitive that solves the "who do we page" gap.
- concepts/critical-business-operation — the primitive that defines what to alert on.
- concepts/alert-fatigue — the failure mode cause-based alerting accelerates and symptom-based alerting resists.
- concepts/operation-based-slo — the SLO granularity that provides the threshold for symptom alerts.
- concepts/observability
- systems/zalando-adaptive-paging