Skip to content

CONCEPT Cited by 3 sources

Fail-open vs fail-closed

A design choice for what a module does when its input is corrupt, out-of-range, or fails an invariant:

  • Fail-closed — refuse to serve. Return 5xx / drop the request / panic the worker. Safer in security contexts (default-deny); dangerous in availability contexts (a single bad input takes out every request that reaches the module).
  • Fail-open — log the error, fall back to a known-good prior state or pass traffic without scoring, continue serving. Safer in availability contexts; dangerous in security contexts if the module was the only thing enforcing a policy.

The choice is not universal; different modules on the same hot path can and should differ. The discipline is to make the choice explicitly per module, not by accident.

The implicit-fail-closed trap

Many crashes are fail-closed by default, not by design:

  • .unwrap() on a Rust Result panics the worker.
  • A nil-index in Lua throws an exception the request handler doesn't catch.
  • An assertion in C++ aborts the process.

The programmer chose a terse syntax; the runtime chose fail-closed. The architecture never explicitly chose.

Canonical Cloudflare instances

The 12-05 post names the stated remediation "Fail-Open" Error Handling: "if a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios."

The 11-18 post names the same project earlier: "Reviewing failure modes for error conditions across all core proxy modules."

Trade-off articulation

  • WAF / Bot Management / security modules — fail-open is controversial: you may serve traffic the module would have blocked. But at scale, serving-without-scoring is usually better than 5xx for everyone — the 5xx case denies service to the entire customer base, including the legitimate users the security module exists to protect.
  • Some customers may prefer fail-closed — explicit per- customer choice (from the 12-05 post: "give the customer the option to fail open or closed in certain scenarios").

Fail-stale as a stronger default

The 2026-05-01 Code Orange: Fail Small is complete post extends the binary fail-open-vs-fail-closed framing to a ternary ladder by introducing fail-stale as the preferred default:

We will now use the last known good configuration where possible ("fail stale"), and if that isn't possible we have reviewed each failure case and implemented "fail open" or "fail close" depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.

Ordering: fail stale (correct-behaviour-on-outdated-data) > fail open (degraded-behaviour) > fail closed (unavailable). Fail stale strictly dominates fail open whenever a last-known-good version exists — the "where possible" caveat is honest about the precondition (the module has to maintain a reference to the prior valid version, not just a single buffer).

Seen in

Last updated · 542 distilled / 1,571 read