SYSTEM

ZMON¶

ZMON is Zalando's in-house monitoring system, open-sourced at opensource.zalando.com/zmon. Canonical wiki mentions are scoped to its role as the monitoring pipeline whose TSDB ingestion layer hit capacity during Cyber-Week pre-scaling, motivating the metrics-tiering playbook that Zalando still runs today.

Role¶

ZMON is a check-and-alert + metrics-collection platform; it ingests metrics from deployed applications and writes them to a time-series store for dashboards + alert evaluation. At the time of the 2023 retrospective post, ZMON's metrics backend was KairosDB on top of Cassandra. The article implies the metrics store has since been replaced ("this playbook is still in place today, even though we changed our metrics storage") without naming the successor.

The Cyber-Week overload¶

Zalando's pre-scaling of the fleet for Cyber-Week peak traffic produced a multi-factor increase in metrics pushed by individual application instances — each pre-scaled instance emits the same metrics at the same rate, so N× fleet = N× emit volume. The KairosDB + Cassandra ingestion layer couldn't absorb the spike:

"Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload." —

The mitigation was a three-tier metrics criticality scheme (concepts/metrics-tiering-by-criticality) with a pre-approved playbook to drop tier-3 and tier-2 metrics under TSDB stress — yielding 40% TSDB load reduction with 2-minute MTTR and no business impact (tier-1 dashboards stayed functional).

ZMON as an example, not a deep dive¶

The source post treats ZMON as a worked example of incident- playbook management, not as a subject in its own right. The details the wiki captures are:

Ingests metrics from the Zalando fleet.
Backs onto KairosDB + Cassandra (2019–~2023).
Has a three-tier metrics criticality classification.
Has a playbook-driven drop mechanism for overload.
Was migrated off KairosDB some time between the incident and the 2023 post (successor not named).

Seen in¶

Zalando incident playbooks (2023) — — canonical wiki instance. Example 2 of the two worked playbook examples in the post. Anchors concepts/metrics-tiering-by-criticality + patterns/drop-non-critical-metrics-under-tsdb-overload.

concepts/observability — the containing discipline.
concepts/metrics-tiering-by-criticality — the classification ZMON canonicalises on the wiki.
concepts/incident-playbook — the mechanism that mitigates ZMON TSDB overload.
patterns/drop-non-critical-metrics-under-tsdb-overload — the canonical mitigation pattern.
systems/apache-cassandra — the TSDB substrate (via KairosDB).
systems/elasticsearch — Zalando's other heavy observability-store neighbour.
companies/zalando — axis 28.

ZMON¶

Role¶

The Cyber-Week overload¶

ZMON as an example, not a deep dive¶

Seen in¶

Related¶