Skip to content

PATTERN Cited by 1 source

Heartbeat absence as alert trigger

Pattern

Emit a continuous heartbeat through the entire alerting pipeline to an external channel, and page when the rate of heartbeats drops to zero — not when any specific condition is breached. The absence of signal is the signal.

This is the dead-man's-switch mechanism expressed as a reusable architectural pattern.

Why this is structurally stronger than condition-triggered alerts

Condition-triggered alerts ("error rate > X", "p99 > Y") require the alert pipeline to be functioning when the alert needs to fire. If the pipeline is broken, those alerts cannot fire — and they cannot tell you they didn't fire.

Heartbeat-absence alerts invert this. The pipeline fires continuously during healthy operation. A break anywhere in the chain — scrape stalled, rule evaluation stopped, delivery blocked, egress lost — causes the heartbeat to stop, and the external watchdog pages on silence.

The watchdog doesn't need to know where the break is. It fires on any failure of the chain it doesn't control.

Three-component shape

┌──────────────────┐
│ monitored stack  │
│ (observability)  │
│                  │
│ always-firing    │
│ alert rule       │
│       │          │
│       ▼          │
│ Alertmanager     │
└────────┬─────────┘
         │ continuous heartbeat
         │ (via network)
┌──────────────────┐
│ external channel │   ← different control plane
│ (e.g., AWS SNS)  │
└────────┬─────────┘
┌──────────────────┐
│ rate watchdog    │
│ (e.g., CloudWatch│
│ alarm on msg/sec)│
└────────┬─────────┘
         │ pages when rate → 0
      on-call

Three components:

  1. Always-firing rule inside the monitored stack. It is engineered to fire whenever the stack is minimally healthy ("Prometheus is scraping correctly").
  2. Delivery to an external channel that is expected to be more stable — typically a cloud-managed service on a different control plane.
  3. Rate-based watchdog at the destination that alarms on absence.

Canonical instance (Airbnb)

From the 2026-05-05 post:

"We maintain an alerting rule that always fires as long as Prometheus is scraping correctly. Alertmanager continuously sends these alerts to an external AWS SNS topic, and a CloudWatch alarm monitors the rate of incoming messages. If they stop — because Prometheus is down, scraping has stalled, Alertmanager can't send, or something else has degraded — the CloudWatch alarm triggers and on-call is paged."

The three Airbnb components:

  1. Prometheus always-firing alert rule.
  2. SNS topic (AWS managed service, different control plane than the K8s-hosted observability stack).
  3. CloudWatch alarm on SNS message rate.

Substrate requirements

  • External channel on a different control plane. Sending the heartbeat to another internal service defeats the purpose if both depend on the same infrastructure.
  • Rate-based alarm at the destination. The destination must be capable of evaluating "did I receive N messages in the last T seconds?". CloudWatch, Datadog, PagerDuty integrations, and most cloud-native monitoring services support this shape.
  • Reliable delivery to the external channel. Typically HTTPS with retry, or managed SDK integrations.

Tuning surface

  • Heartbeat cadence. Faster beat → lower detection latency, more external-channel writes. Airbnb's specific cadence is not disclosed.
  • Rate-alarm evaluation window. Too short → transient network blips page spuriously; too long → detection latency grows. Balance is workload-specific.
  • Alarm threshold. Expressed as "fewer than N messages in window W". Typically set to catch any break but tolerate a single dropped message.

Composes with meta-monitoring

The DMS pattern is specifically what terminates the meta-monitoring regress. Without it, you'd need infinitely many layers of monitors watching monitors.

With it:

  • Primary observability stack watches the product.
  • Meta-monitoring layer watches the primary observability stack.
  • DMS watches everything with external heartbeat absence as the trigger.

Three layers is enough because the third layer lives on infrastructure distinct from the first two.

Failure modes

  • External channel failure. If the external channel is itself down, the watchdog can't receive the heartbeat and fires a false alarm. Rare, but if the external channel is flaky, the DMS becomes noisy.
  • Threshold too tight. Transient delivery blips trigger spurious pages.
  • Threshold too loose. Real silence takes too long to detect.
  • Heartbeat rule becomes stale. If someone changes the alert rule and breaks the "always-firing" invariant, the DMS fires immediately — which is actually the correct failure mode, but it's a reliability-testing story at the PR-review tier.
  • Over-reliance on DMS. The DMS tells you "something broke". You still need the monitored stack's own metrics for diagnosis once it's restored.

Caveats

  • The external channel becomes a trust anchor. If the external-channel provider has an incident, the DMS is silent. Airbnb's choice of AWS SNS + CloudWatch makes AWS the trust anchor; this risk is not addressed in the post.
  • Latency of detection is bounded below by the rate-alarm window. Very fast detection needs very tight windows and tight heartbeat cadences.
  • Cost scales with heartbeat rate. SNS publish + CloudWatch evaluations are per-event. Airbnb doesn't disclose.

Seen in

  • sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's meta-monitoring chain terminates in an always-firing Prometheus rule → Alertmanager → AWS SNS → CloudWatch rate-alarm → on-call page. External AWS control plane distinct from the K8s-hosted observability stack it watches.
Last updated · 451 distilled / 1,324 read