CONCEPT Cited by 1 source
Intermittent failure signal confusion¶
An incident pattern where the shape of the failure signal (intermittent, oscillating, spiky) mimics an external attack more than an internal bug โ leading responders to investigate the wrong hypothesis for the early minutes of the incident.
Why it happens¶
Most internal bugs are monotone: either always-failing or always-working. Monotonic failure + monotonic recovery is the debugging profile an on-call engineer has seen a thousand times.
Oscillating failure is rare for internal bugs and common for external attacks: rate-controlled probing, traffic fluctuation under DDoS mitigation, bot armies cycling IP ranges. The responder's prior on "intermittent = attack" is well-calibrated against the usual distribution of incidents.
The failure mode: a gradual internal rollout that produces an oscillation indistinguishable from an external probe.
Canonical instance¶
sources/2025-11-18-cloudflare-outage-on-november-18-2025 โ Cloudflare's ClickHouse permission migration was rolling out gradually across cluster nodes. Bot Management's feature-file generator ran every 5 minutes; some runs hit a migrated node (bad file), others hit a non-migrated node (good file). The global network oscillated good/bad/good/bad on a ~5-minute period. Cloudflare's status page went down coincidentally at the same time (hosted off-Cloudflare, independent failure), deepening the attack suspicion. Teams spent roughly the first 40 minutes on the DDoS hypothesis before identifying Bot Management as the cause.
From the post: "This fluctuation made it unclear what was happening as the entire system would recover and then fail again ... Initially, this led us to believe this might be caused by an attack."
Mitigations¶
- Internal-change awareness in on-call dashboards. Surface in-flight rollouts alongside incident signals.
- Correlation on change-event timestamps. If the failure onset is within seconds of a config-system push or database migration, weight the internal-bug hypothesis above the attack hypothesis.
- Kill switches independent of root-cause identification. A global feature killswitch that can take a suspect module out of the hot path without knowing which hypothesis is correct.
Seen in¶
- sources/2025-11-18-cloudflare-outage-on-november-18-2025 โ canonical wiki instance.