CONCEPT Cited by 2 sources
Fail-open vs fail-closed¶
A design choice for what a module does when its input is corrupt, out-of-range, or fails an invariant:
- Fail-closed — refuse to serve. Return 5xx / drop the request / panic the worker. Safer in security contexts (default-deny); dangerous in availability contexts (a single bad input takes out every request that reaches the module).
- Fail-open — log the error, fall back to a known-good prior state or pass traffic without scoring, continue serving. Safer in availability contexts; dangerous in security contexts if the module was the only thing enforcing a policy.
The choice is not universal; different modules on the same hot path can and should differ. The discipline is to make the choice explicitly per module, not by accident.
The implicit-fail-closed trap¶
Many crashes are fail-closed by default, not by design:
.unwrap()on a RustResultpanics the worker.- A nil-index in Lua throws an exception the request handler doesn't catch.
- An assertion in C++ aborts the process.
The programmer chose a terse syntax; the runtime chose fail-closed. The architecture never explicitly chose.
Canonical Cloudflare instances¶
- 2025-11-18 — FL2's Bot Management module
.unwrap()ed on a feature-file size-cap check. Over-limit → panic → 5xx for every request. ~3 hours core-traffic outage. See sources/2025-11-18-cloudflare-outage-on-november-18-2025. Implicit fail-closed. - 2025-12-05 — FL1's rulesets engine nil-indexed on a
rule_result.executepost-processing path that had never run before. Exception → 5xx → ~25 min outage. See sources/2025-12-05-cloudflare-outage-on-december-5-2025. Implicit fail-closed.
The 12-05 post names the stated remediation "Fail-Open" Error Handling: "if a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios."
The 11-18 post names the same project earlier: "Reviewing failure modes for error conditions across all core proxy modules."
Trade-off articulation¶
- WAF / Bot Management / security modules — fail-open is controversial: you may serve traffic the module would have blocked. But at scale, serving-without-scoring is usually better than 5xx for everyone — the 5xx case denies service to the entire customer base, including the legitimate users the security module exists to protect.
- Some customers may prefer fail-closed — explicit per- customer choice (from the 12-05 post: "give the customer the option to fail open or closed in certain scenarios").
Seen in¶
- sources/2025-11-18-cloudflare-outage-on-november-18-2025
- sources/2025-12-05-cloudflare-outage-on-december-5-2025