CONCEPT Cited by 2 sources
Fail stale¶
A failure-mode default where a module, on receiving an input (config file, rule set, feature file, topology snapshot) that fails validation or cannot be read, continues using the last known good version of that input rather than either crashing (fail-closed) or serving with the input removed (fail-open).
"Fail stale" extends the binary fail-open-vs-fail-closed question to a three-way ladder:
- Fail stale (preferred) — use the last-known-good input; the module serves with correct behaviour on outdated data.
- Fail open — serve without the input; the module serves with degraded behaviour (e.g., bot scoring returning 0, WAF passing traffic without rules). Degraded is preferable to unavailable.
- Fail closed — refuse to serve; the module returns 5xx. Preserves correctness at the cost of availability. Usually the wrong default except when the module is the only thing enforcing a policy that must never be bypassed.
Cloudflare's framing¶
The term is named explicitly in the 2026-05-01 Code Orange: Fail Small is complete post:
We will now use the last known good configuration where possible ("fail stale"), and if that isn't possible we have reviewed each failure case and implemented "fail open" or "fail close" depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.
The post's worked example generalises from the 2025-11-18 feature-file incident:
If data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.
The ordering matters: fail stale first, fail open as second-best fallback, fail closed as explicit exception.
Why stale is preferred over open¶
Consider a Bot Management module receiving a malformed feature file:
- Fail closed — every request returns 5xx. Traffic is dropped.
- Fail open — every request is scored as "not a bot" (degraded). Real bots get through; legitimate traffic is served.
- Fail stale — every request is scored with the previously-loaded feature file. Real bots are still caught; legitimate traffic is still served; the only cost is that the feature file is minutes or hours out of date.
Fail stale is strictly better than fail open whenever a last-known-good version exists. The "where possible" caveat in the Cloudflare framing is honest: sometimes there's no last-known-good (first-time boot, state loss, schema change that invalidates the old form). Fail open is then the second-best-available.
Preconditions¶
Fail stale requires the module to maintain a reference to the previous valid version separate from the currently-loading version. The load pipeline becomes:
- Validate the new input on a staging buffer.
- If validation fails, log + alert + keep serving with the previous buffer; don't swap.
- If validation passes, atomically swap the active pointer to the new buffer.
This interacts with preallocated-memory optimisations — a fixed-size buffer that can only hold one version is a failure mode for fail-stale. A double-buffered module that always keeps the last-good copy is the natural substrate.
When fail-stale is insufficient¶
- The input is the thing that needs to be fresh. E.g., a live attack-signature feed — serving a stale signature means serving with 5-minute-old attack knowledge. Here the fail-open-to-remove-the-module and raise an alert may be the right call so the threat-response team knows to push a new signature urgently.
- Schema-incompatible update. The new input can't be rejected in isolation if the old schema is no longer supported by the new code. Avoided by never deploying the schema change atomically with the code change (see patterns/expand-migrate-contract).
Canonical wiki instance¶
sources/2026-05-01-cloudflare-code-orange-fail-small-complete — Cloudflare names "fail stale" as the preferred failure-mode default explicitly. The worked example shows the November 2025 Bot Management failure: "if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic" — and the fallback chain continues:
If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.
Seen in¶
- sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage
— DNS-resolver-altitude instance of fail-stale. Codified
for recursive DNS resolvers as serve-stale (RFC 8767):
when an upstream fetch fails (timeout, SERVFAIL, or
unverifiable DNSSEC signatures), serve the last-known-good
cached record past its TTL rather than returning an error.
On 2026-05-05, DENIC (the
.deTLD registry) began publishing non-validatable DNSSEC signatures during a routine key rollover. systems/cloudflare-1-1-1-1-resolver|1.1.1.1 serving stale via Big Pineapple is the reason NOERROR rates stayed stable for ~3 hours despite the upstream break — Cloudflare's verbatim framing: "That's 'serve stale' at work." See the dedicated pattern page patterns/serve-stale-over-servfail. This is the DNS realisation of the same three-way ladder the 2026-05-01 Code Orange post named at configuration-deployment altitude: fail-stale (serve old record) first, fail-open (e.g., declare a Negative Trust Anchor bypassing validation) as second-best, fail-closed (SERVFAIL) only when both are unavailable or inappropriate. - sources/2026-05-01-cloudflare-code-orange-fail-small-complete — canonical wiki instance of the term and the three-way ladder; explicit preference over binary fail-open / fail- closed; worked example against the 2025-11-18 feature-file incident.
- sources/2025-11-18-cloudflare-outage-on-november-18-2025
— absence-of-fail-stale instance. The FL2 bots module had no
last-known-good-retention mechanism; the bad feature file
overwrote the active buffer;
.unwrap()panicked on the bounds check; fail-closed-by-implicit-default. - sources/2025-12-05-cloudflare-outage-on-december-5-2025 — absence-of-fail-stale instance at a different surface (rulesets engine post-processing). Stated remediation: "Fail-Open Error Handling"; Code Orange's completion brings the stronger "Fail-Stale Error Handling" as the preferred default.
Related¶
- concepts/fail-open-vs-fail-closed — the binary framing this concept extends.
- concepts/internally-generated-untrusted-input — the trust-boundary concept that makes ingest validation necessary; fail stale is the failure-mode behaviour when validation rejects a new input.
- patterns/harden-ingestion-of-internal-config — the construction principle; validation + fail-stale + the double-buffered load pipeline are the implementation.
- concepts/feature-file-size-limit — the bounds-check invariant; preallocation alone doesn't give fail-stale unless paired with double-buffering.
- systems/cloudflare-bot-management — the canonical module where the 2025-11-18 absence-of-fail-stale detonated; post-Code-Orange instance of fail-stale applied.
- systems/snapstone — the config-deployment system whose health-gated rollout catches the bad input before it reaches most of the fleet; fail-stale is the module-tier companion to Snapstone's rollout-tier guard.
- concepts/dns-resolver-caching · patterns/serve-stale-over-servfail — the DNS-resolver altitude realisation of the same failure-mode-default principle.
- systems/cloudflare-1-1-1-1-resolver — Big Pineapple's
serve-stale implementation cushioned the 2026-05-05
.deDNSSEC break per the 2026-05-06 writeup.