Skip to content

SYSTEM Cited by 1 source

ZMON

ZMON is Zalando's in-house monitoring system, open-sourced at opensource.zalando.com/zmon. Canonical wiki mentions are scoped to its role as the monitoring pipeline whose TSDB ingestion layer hit capacity during Cyber-Week pre-scaling, motivating the metrics-tiering playbook that Zalando still runs today.

Role

ZMON is a check-and-alert + metrics-collection platform; it ingests metrics from deployed applications and writes them to a time-series store for dashboards + alert evaluation. At the time of the 2023 retrospective post, ZMON's metrics backend was KairosDB on top of Cassandra. The article implies the metrics store has since been replaced ("this playbook is still in place today, even though we changed our metrics storage") without naming the successor.

The Cyber-Week overload

Zalando's pre-scaling of the fleet for Cyber-Week peak traffic produced a multi-factor increase in metrics pushed by individual application instances — each pre-scaled instance emits the same metrics at the same rate, so N× fleet = N× emit volume. The KairosDB + Cassandra ingestion layer couldn't absorb the spike:

"Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload."sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks

The mitigation was a three-tier metrics criticality scheme (concepts/metrics-tiering-by-criticality) with a pre-approved playbook to drop tier-3 and tier-2 metrics under TSDB stress — yielding 40% TSDB load reduction with 2-minute MTTR and no business impact (tier-1 dashboards stayed functional).

ZMON as an example, not a deep dive

The source post treats ZMON as a worked example of incident- playbook management, not as a subject in its own right. The details the wiki captures are:

  • Ingests metrics from the Zalando fleet.
  • Backs onto KairosDB + Cassandra (2019–~2023).
  • Has a three-tier metrics criticality classification.
  • Has a playbook-driven drop mechanism for overload.
  • Was migrated off KairosDB some time between the incident and the 2023 post (successor not named).

Seen in

Last updated · 550 distilled / 1,221 read