CONCEPT Cited by 1 source
Fail stale¶
A failure-mode default where a module, on receiving an input (config file, rule set, feature file, topology snapshot) that fails validation or cannot be read, continues using the last known good version of that input rather than either crashing (fail-closed) or serving with the input removed (fail-open).
"Fail stale" extends the binary fail-open-vs-fail-closed question to a three-way ladder:
- Fail stale (preferred) — use the last-known-good input; the module serves with correct behaviour on outdated data.
- Fail open — serve without the input; the module serves with degraded behaviour (e.g., bot scoring returning 0, WAF passing traffic without rules). Degraded is preferable to unavailable.
- Fail closed — refuse to serve; the module returns 5xx. Preserves correctness at the cost of availability. Usually the wrong default except when the module is the only thing enforcing a policy that must never be bypassed.
Cloudflare's framing¶
The term is named explicitly in the 2026-05-01 Code Orange: Fail Small is complete post:
We will now use the last known good configuration where possible ("fail stale"), and if that isn't possible we have reviewed each failure case and implemented "fail open" or "fail close" depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.
The post's worked example generalises from the 2025-11-18 feature-file incident:
If data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.
The ordering matters: fail stale first, fail open as second-best fallback, fail closed as explicit exception.
Why stale is preferred over open¶
Consider a Bot Management module receiving a malformed feature file:
- Fail closed — every request returns 5xx. Traffic is dropped.
- Fail open — every request is scored as "not a bot" (degraded). Real bots get through; legitimate traffic is served.
- Fail stale — every request is scored with the previously-loaded feature file. Real bots are still caught; legitimate traffic is still served; the only cost is that the feature file is minutes or hours out of date.
Fail stale is strictly better than fail open whenever a last-known-good version exists. The "where possible" caveat in the Cloudflare framing is honest: sometimes there's no last-known-good (first-time boot, state loss, schema change that invalidates the old form). Fail open is then the second-best-available.
Preconditions¶
Fail stale requires the module to maintain a reference to the previous valid version separate from the currently-loading version. The load pipeline becomes:
- Validate the new input on a staging buffer.
- If validation fails, log + alert + keep serving with the previous buffer; don't swap.
- If validation passes, atomically swap the active pointer to the new buffer.
This interacts with preallocated-memory optimisations — a fixed-size buffer that can only hold one version is a failure mode for fail-stale. A double-buffered module that always keeps the last-good copy is the natural substrate.
When fail-stale is insufficient¶
- The input is the thing that needs to be fresh. E.g., a live attack-signature feed — serving a stale signature means serving with 5-minute-old attack knowledge. Here the fail-open-to-remove-the-module and raise an alert may be the right call so the threat-response team knows to push a new signature urgently.
- Schema-incompatible update. The new input can't be rejected in isolation if the old schema is no longer supported by the new code. Avoided by never deploying the schema change atomically with the code change (see patterns/expand-migrate-contract).
Canonical wiki instance¶
sources/2026-05-01-cloudflare-code-orange-fail-small-complete — Cloudflare names "fail stale" as the preferred failure-mode default explicitly. The worked example shows the November 2025 Bot Management failure: "if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic" — and the fallback chain continues:
If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.
Seen in¶
- sources/2026-05-01-cloudflare-code-orange-fail-small-complete — canonical wiki instance of the term and the three-way ladder; explicit preference over binary fail-open / fail- closed; worked example against the 2025-11-18 feature-file incident.
- sources/2025-11-18-cloudflare-outage-on-november-18-2025
— absence-of-fail-stale instance. The FL2 bots module had no
last-known-good-retention mechanism; the bad feature file
overwrote the active buffer;
.unwrap()panicked on the bounds check; fail-closed-by-implicit-default. - sources/2025-12-05-cloudflare-outage-on-december-5-2025 — absence-of-fail-stale instance at a different surface (rulesets engine post-processing). Stated remediation: "Fail-Open Error Handling"; Code Orange's completion brings the stronger "Fail-Stale Error Handling" as the preferred default.
Related¶
- concepts/fail-open-vs-fail-closed — the binary framing this concept extends.
- concepts/internally-generated-untrusted-input — the trust-boundary concept that makes ingest validation necessary; fail stale is the failure-mode behaviour when validation rejects a new input.
- patterns/harden-ingestion-of-internal-config — the construction principle; validation + fail-stale + the double-buffered load pipeline are the implementation.
- concepts/feature-file-size-limit — the bounds-check invariant; preallocation alone doesn't give fail-stale unless paired with double-buffering.
- systems/cloudflare-bot-management — the canonical module where the 2025-11-18 absence-of-fail-stale detonated; post-Code-Orange instance of fail-stale applied.
- systems/snapstone — the config-deployment system whose health-gated rollout catches the bad input before it reaches most of the fleet; fail-stale is the module-tier companion to Snapstone's rollout-tier guard.