CONCEPT Cited by 1 source

Dead man's switch¶

Definition¶

A dead man's switch (DMS) is a monitoring primitive where a continuously-firing heartbeat is delivered to an external channel, and the absence of that heartbeat — not its presence — is what triggers the alert. The term is borrowed from locomotive safety: if the operator lets go of the handle (becomes incapacitated), the train stops.

In observability:

"A mechanism that sends a steady signal. The recipient of the signal can assume something is wrong when the signal disappears." (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale)

Why absence-of-signal is structurally stronger than presence-of-signal¶

A normal alert fires on a specific condition (error rate > 5%, p99 > 1s, etc.). For it to page, three things must work: the scrape must succeed, the rule must evaluate, and the delivery path must be up. If the monitored system's metrics stop arriving, the alert cannot fire — but it also cannot tell you that.

A DMS inverts this. The alert fires continuously as long as the full pipeline is healthy. The external watchdog's condition becomes: "the heartbeat stopped — something broke somewhere in the pipeline." The watchdog doesn't need to know where it broke — scrape failure, rule evaluation failure, delivery failure, network partition, node crash — any silence is alarming.

Terminates the meta-monitoring regress¶

concepts/meta-monitoring runs into a tower-of-turtles problem: if the meta-monitoring layer fails, who watches the watcher? Airbnb's post names the tension explicitly:

"How do we know if the meta-monitoring layer is down? Spinning up yet another monitoring stack would just lead to an infinite regress."

The DMS terminates the regress by moving the final watchdog onto infrastructure the monitored system does not run on — typically a cloud-provider managed service. The assumption is not that the provider is infallible, but that its failure modes are uncorrelated with the ones being guarded against.

Canonical implementation (Airbnb)¶

An always-firing alerting rule in Prometheus: "we maintain an alerting rule that always fires as long as Prometheus is scraping correctly."
Delivered continuously via Alertmanager to an external sink: "Alertmanager continuously sends these alerts to an external AWS SNS topic."
A rate alarm on the sink: "a CloudWatch alarm monitors the rate of incoming messages. If they stop — because Prometheus is down, scraping has stalled, Alertmanager can't send, or something else has degraded — the CloudWatch alarm triggers and on-call is paged."

The key architectural property: SNS + CloudWatch are on AWS's managed control plane. Airbnb's observability stack runs on (dedicated) Kubernetes. A failure of the observability stack does not affect the watchdog's ability to fire.

What the DMS actually tests¶

Because the heartbeat traverses the entire alerting pipeline, a healthy DMS implicitly validates:

Prometheus scraping (the rule can only fire if Prometheus is ingesting data)
Rule evaluation (the rule engine is running)
Alertmanager delivery (the alert must be routed externally)
Network egress to the external channel
The external channel's ingestion

Any break in that chain silences the heartbeat. The watchdog doesn't need per-component visibility — it just pages when the chain breaks anywhere.

Adjacent concept on the wiki¶

The wiki already has concepts/heartbeat-based-replication-lag-measurement — which uses the presence of ordered heartbeats to infer lag. A DMS is the dual: absence of any heartbeat triggers the signal. Both rely on the same underlying primitive (a regularly-emitted beat), but one watches for position drift and the other for silence.

When a DMS alone is not enough¶

A DMS tells you "something in the observability pipeline stopped". It does not tell you what. You need:

Per-component metrics inside the stack (Prometheus' up{}, Alertmanager cluster status, node health) for root-cause triage once the DMS has fired.
Runbook that tells on-call "DMS fired — first check AWS → SNS → Alertmanager → Prometheus chain in reverse order."

The DMS is the "is anything wrong?" detector, not the "what is wrong?" diagnostic.

Caveats¶

The external channel becomes the new trust anchor. Airbnb externalises to AWS. If SNS + CloudWatch themselves fail, the DMS is silent. The Airbnb post does not discuss this residual risk.
Threshold sensitivity. Too-aggressive rate alarms page on transient SNS delivery blips; too-loose thresholds extend detection latency. The Airbnb post does not disclose the threshold.
DMS does not replace specific alerts. It catches the case "alerts would have fired but couldn't" — not "a specific thing is wrong with the production system being watched." Both kinds of alerts are needed.

Seen in¶

sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb terminates its meta- monitoring chain with an always-firing Prometheus rule → Alertmanager → AWS SNS topic → CloudWatch rate alarm → on-call page. The external AWS control plane is the independent fault domain that makes the watchdog robust to observability-stack outages.

concepts/meta-monitoring — the parent discipline the DMS terminates.
concepts/observability
concepts/circular-dependency
concepts/heartbeat-based-replication-lag-measurement — adjacent heartbeat primitive, inverse semantics.
patterns/heartbeat-absence-as-alert-trigger — the reusable pattern this concept names.
systems/alertmanager
systems/aws-sns
systems/aws-cloudwatch