Skip to content

PATTERN Cited by 1 source

Drop non-critical metrics under TSDB overload

Intent

Mitigate time-series-database (TSDB) overload — whether driven by traffic spikes, cluster degradation, or pre-scaling multiplicative-fan-out — by dropping metrics in tier-ordered fashion (tier-3 first, tier-2 next, tier-1 never) to reduce ingest load while preserving the observability signal required for incident response.

Context

Metrics emit-rates scale with deployed fleet size. At Cyber-Week-scale pre-scaling in Zalando's 2019 incident:

"Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload."sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks

The failure mode is structural: the monitoring system you need to watch the event is being degraded by the event you're watching. If all metrics are equally important, the only mitigations are expensive (scale the TSDB fleet horizontally mid- event, which takes time) or accept the outage.

Solution

  1. Classify every emitted metric into a criticality tier at registration / emit time (concepts/metrics-tiering-by-criticality). Typically three tiers:
  2. Tier 1 — load-bearing for incident response (CBO dashboards, Cyber-Week monitoring).
  3. Tier 2 — useful operational metrics.
  4. Tier 3 — long-tail / detailed breakdown / debugging metrics.
  5. Expose a configuration flag that tells the metrics pipeline the current drop-threshold (e.g. drop_threshold: 3 means drop tier-3 and above).
  6. Write a pre-approved playbook that flips the flag:
  7. Title: "Drop non-critical metrics due to TSDB overload".
  8. Trigger: metrics-ingestion SLO at risk of breach.
  9. Operational impact: loss of tier-3 (and tier-2 if needed) metrics; tier-1 continues.
  10. Business impact: none (downstream business users don't consume tier-3 metrics directly).
  11. MTTR: 2 minutes (config update + pipeline picks up).
  12. Apply least- impact-first ordering: drop tier-3 only, re-evaluate; drop tier-2 only if needed; never drop tier-1.
  13. Revert after incident. Once TSDB headroom returns, flip the flag back.

Benefits

  • Immediate operational relief without a TSDB scale-out. Zalando observed 40% TSDB load reduction in the Cyber- Week case.
  • Zero business impact. Tier-1 metrics (the ones powering incident dashboards) keep flowing; the people watching don't notice a change in their essential signal.
  • Self-contained mitigation. No customer-facing feature is affected; no cross-team coordination required.
  • Longevity across storage migrations. Zalando notes: "This playbook is still in place today, even though we changed our metrics storage." The tier-tags are storage- agnostic.
  • Doubles as capacity-planning signal. If the playbook fires often, tier-1 metrics rate is approaching TSDB capacity ceiling — schedule a capacity investment.

Costs and caveats

  • Needs tier classification up front. Adding tier metadata to every metric is either a manual discipline (new metric → author picks a tier) or a default-tier rule (unclassified = tier-3; explicit promotion required). Neither is free; consistency matters.
  • Tier-2 loss is not zero impact for all operators. Individual engineers may use tier-2 metrics for their own service's dashboards. A playbook trigger of "drop tier-2" affects those operators — they lose signal during the incident. The business impact is "none for the business", not "none for engineers".
  • Drop at the right layer. If the bottleneck is the ingestion path (not the storage), dropping at the pipeline-edge buys nothing; you need to drop at the emitter. If the bottleneck is storage, dropping after ingest is enough. Inspect before choosing.
  • Requires a config-flip pipeline. The pipeline must read the drop-threshold dynamically; restart-to-apply defeats the MTTR.
  • Cardinality explosion is separate. This pattern addresses volume, not cardinality. A sudden explosion in unique label combinations requires a different mitigation (label rewrites, high-cardinality detection).

Known uses

  • Zalando ZMON, 2019-present — three-tier metrics scheme on ZMON's KairosDB-on-Cassandra TSDB; pre-approved playbook with 2-minute MTTR and 40% TSDB load reduction. Introduced during Cyber-Week pre-scaling; still in place after the storage layer was replaced. Canonical wiki instance. (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)

Anti-patterns

  • No tier classification — every metric equally important means the only mitigation is scaling; emergency scaling is slow.
  • Tier-1 gets dropped — defeats the point; tier-1 is defined as "we need this to run the incident". If tier-1 gets dropped, tier-1 was miscategorised.
  • Permanent drop — the playbook is a mitigation, not a steady state. Metrics should be restored post-incident to preserve long-term observability.
Last updated · 550 distilled / 1,221 read