CONCEPT Cited by 2 sources

Symptom-based alerting¶

Symptom-based alerting is the alerting strategy in which the primary alerting rules fire on user-visible symptoms (a business operation's error rate or latency SLO breach, checkout failure rate, search availability) rather than on internal cause metrics (CPU, disk, individual-service error rate, queue depth). Cause metrics still exist for diagnosis but are not what pages the on-call.

Definition¶

"This alert handler was also a game changer in our push for a different alerting strategy: Symptom Based Alerting." —

The strategy has three pillars:

Alert the user's experience. If the user doesn't see a degradation, the machine is allowed to be unhealthy. If the user sees a degradation, someone pages — regardless of whether any individual subsystem looks broken.
Keep cause metrics as diagnostic panes. CPU, disk, lock contention, queue depth — all still collected and graphed, but they're for debugging after the page, not for firing the page.
Per-user-journey / per-business-operation granularity. The symptom is measured at the Critical Business Operation altitude, not per service.

Why symptom-based alerting outperforms cause-based at scale¶

Classical cause-based alerting (alert on CPU, on service error rate, on disk utilisation, on individual RPCs) fails in three ways at fleet scale:

Per-service alert sprawl. 4,000 services × 10 rules each = 40,000 alerting rules. Every rule is a chance for noise, a maintenance burden, and a stale-threshold risk. See concepts/alert-fatigue.
Local health ≠ global health. Every component can look green while the compound user journey is breaking. A multi-service user journey at 3 failed services × 0.5% error rate each yields ~1.5% compound error rate — and no individual service's cause-alert fires.
Cause space is open-ended. Services have bugs you've never seen; a cause-metric alert set will always have coverage gaps. Symptom alerts catch unknown combinations for free: the symptom fires whenever the user is degraded, whatever the root cause.

The core tradeoff: symptom-based alerting is more robust but provides less immediate diagnostic signal. You page on a CBO error rate of 2%, and the on-call has to figure out why — so the strategy depends on fast root-cause tooling being in place.

The "who do we page" gap¶

Symptom-based alerting surfaces a routing problem: if the symptom is "checkout error rate is up" and 15 services touch checkout, who gets paged? Naive resolutions — pager the CBO owner, pager the on-call SRE rotation — either load one team with all incidents or impose an indirection hop. Neither scales.

Zalando's answer: Adaptive Paging pairs with symptom-based alerting to close the loop: use the live distributed-tracing graph at alert time to identify the service whose span is the likely root cause, and page that team directly. The two primitives are explicitly co-designed in the Zalando framing:

"This alert handler [adaptive paging] was also a game changer in our push for a different alerting strategy: Symptom Based Alerting."

Google SRE book lineage¶

Symptom-based alerting is Google SRE-book orthodoxy:

"The most effective alerts are directly tied to symptoms of a problem." — Google SRE Book, Ch. 4 Service Level Objectives; Ch. 10 Practical Alerting from Time-Series Data.

Zalando's novelty is not the philosophy — it's the platform substrate that makes it feasible: fleet-wide distributed tracing + a CBO taxonomy + adaptive paging are the load-bearing prerequisites that let a 7-person SRE team roll symptom-based alerting out company-wide instead of as a per-team opt-in.

Prerequisites¶

A defined CBO taxonomy — what counts as a user-facing operation worth alerting on?
Fleet-wide distributed tracing — for both the CBO-level error-rate measurement (root-span aggregation) and for adaptive paging's routing decision.
Fast root-cause tooling — because the page no longer tells you what's broken, the investigation stack (traces, logs, dashboards keyed by trace ID) must be <5-minute onboardable.
An ownership map — "service X is owned by team Y" so adaptive paging can translate the identified span into a team.

Anti-patterns¶

Keep cause alerts "just in case." The drift back toward cause-based alerting is high because cause alerts feel more specific. Guard by explicitly retiring service-level alerts once the CBO-level alert covers the same failure mode.
No error budget / SLO backing the alert threshold. A symptom-based alert with arbitrary thresholds ("page if error rate > 0.1%") still produces fatigue. The threshold should come from the CBO's SLO + error-budget policy, not from intuition. See concepts/operation-based-slo.
Symptom alerts without fast diagnostic tooling. If your on-call gets paged with "checkout is broken" and has to grep logs for 30 minutes to figure out which subsystem, the strategy fails on MTTR.

Seen in¶

— Zalando's 2019 alerting-strategy shift, explicitly paired with Adaptive Paging as the routing solution. Public slides at github.com/zalando/public-presentations.
sources/2022-04-27-zalando-operation-based-slos — the technical deep-dive. Canonical formula SLO = Symptom + Target (so a correctly-chosen CBO with an SLO is simultaneously the symptom-alert and the SLO); narrates the MWMBR upgrade to Adaptive Paging that eliminated short-spike false positives and per-alert fine-tuning — the two classic symptom-alerting failure modes. Dogfood numbers: SRE department 3-month trial cut false-positive rate from 56% to 0% and alert volume from 2 to 0.14 alerts/day, with zero user-facing incidents missed — the numerical proof that finally unlocked broader adoption of the symptom-alerting strategy.
sources/2021-10-14-zalando-tracing-sres-journey-part-iii — cites Symptom Based Alerting as the false-positive-rate reducer, complementary to the new anomaly vs incident separation process: symptom alerting reduces noise at the paging layer; anomaly/incident separation reduces ceremony cost on the noise that remains.

concepts/symptom-vs-cause-metric — the underlying metric- design distinction this strategy is built on.
concepts/adaptive-paging — the routing primitive that solves the "who do we page" gap.
concepts/critical-business-operation — the primitive that defines what to alert on.
concepts/alert-fatigue — the failure mode cause-based alerting accelerates and symptom-based alerting resists.
concepts/operation-based-slo — the SLO granularity that provides the threshold for symptom alerts.
concepts/error-budget — the primitive MWMBR alerting operationalises once raw-rate thresholds are retired.
concepts/multi-window-multi-burn-rate — the burn-rate- based alerting strategy that solves the "fine-tuning mess" for symptom alerts.
concepts/observability
systems/zalando-adaptive-paging
systems/zalando-service-level-management-tool