Skip to content

CONCEPT Cited by 1 source

Metrics tiering by criticality

Metrics tiering by criticality is the observability-layer discipline of classifying every emitted metric into a priority tier (typically three tiers) so that an overloaded metrics pipeline — time-series database, ingestion path, query layer — can drop lower-tier metrics on demand to keep the highest-tier metrics flowing. It is the observability analogue of service-tier classification, applied to telemetry instead of services.

Zalando's canonical instance

From the ZMON playbook example:

"Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload. To mitigate similar incidents, we developed a tiering system with three criticality tiers for the metrics, so that in case of overload of the TSDB, we could still ingest the most important metrics necessary to plot essential dashboards required to monitor the Cyber Week event."sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks

Three tiers:

  • Tier 1 — the metrics Cyber-Week dashboards depend on. Load-bearing for incident response itself; dropping them blinds the Situation Room.
  • Tier 2 — useful operational metrics; dropped second.
  • Tier 3 — nice-to-have / long-tail / detailed breakdown metrics; dropped first.

Under the playbook trigger ("Metrics Ingestion SLO is at risk of being breached"), tier-3 and tier-2 ingestion is disabled via configuration update in 2 minutes MTTR, yielding 40% load reduction on the metrics TSDB. Business impact: none — dashboards continue working on tier-1.

The architectural structure

Tiering is a property of the metric's classification metadata, attached at emit-time or registration-time. At ingestion, the metrics pipeline looks at the tier attribute and, under a configuration flag, drops metrics at or above the configured drop-tier threshold. Three common realisations:

  1. Metric-name-prefix convention — tier-3 metrics live under a namespace prefix (e.g. tier3.<service>.<metric>); the pipeline drops by prefix. Cheap, requires discipline from emitters.
  2. Metric-registry metadata — each registered metric has an explicit tier field in the registry; pipeline looks it up. Needs a central registry but supports dynamic re-tiering.
  3. Sampled-at-source — the emitter itself checks a tier flag and stops emitting under pressure. Moves cost to the emitter, avoids the pipeline reading every metric to drop most of them.

Zalando's post doesn't disclose which realisation; it's sufficient for the concept that the tier is attached to the metric.

Why three tiers, not more

Three is a common default:

  • Two tiers collapse the dropping decision to "essential vs everything else", which loses the intermediate option — you can't drop just the long-tail without also dropping the useful operational set.
  • Four or more tiers increase classification cost per new metric without much practical gain; the fourth tier rarely differs operationally from tier-3.
  • Three tiers give the intermediate "drop tier-3 only" option, which is often enough to regain headroom without impairing operator visibility.

Other three-tier schemes on the wiki (service-tier classification, request-priority classes) converge on the same number.

Versus request-level load shedding

Metrics tiering is at the telemetry layer; request-level load shedding is at the serving layer. They compose:

  • Load shedding protects the service under request-load overload.
  • Metrics tiering protects the metrics pipeline under metrics-emission overload.

Both apply the same least-impact-first principle (drop tier-3 before tier-1), operating on different substrates.

The playbook connection

The mitigation is executed via a pre-approved playbook:

"Example playbook for ZMON — Title: Drop non-critical metrics due to TSDB overload — Trigger: Metrics Ingestion SLO is at risk of being breached — MTTR: 2 minutes after updating configuration — Operational Health Impact: Loss of tier-3 and tier-2 metrics. Only tier-1 metrics are processed, leading to 40% load reduction on the metrics TSDB — Business Impact: None."

Canonicalised as patterns/drop-non-critical-metrics-under-tsdb-overload. The combination — tier the metrics at emit time + write a pre-approved playbook that flips the drop-configuration — is the operational shape of metrics tiering.

Longevity: tiering outlives its original TSDB

Zalando notes: "This playbook is still in place today, even though we changed our metrics storage." The tiering metadata is independent of the backing store — a new TSDB backend can consume the same tier labels and drop the same low-tier metrics. The mitigation generalises across storage migrations.

Seen in

Last updated · 550 distilled / 1,221 read