Cloudflare outage on December 5, 2025¶
Summary¶
On 2025-12-05 at 08:47 UTC, a portion of Cloudflare's network began serving HTTP 500 errors for a subset of customers. The incident was resolved at 09:12 UTC — ~25 minutes of total impact. Approximately 28% of Cloudflare's total HTTP traffic was affected.
Root cause was not an attack. A WAF body-parsing change was being rolled out to protect customers against CVE-2025-55182 in React Server Components: the Cloudflare proxy's in-memory HTTP request-body buffer was being raised from 128 KB to 1 MB (the default Next.js limit). This first change was going out via gradual deployment — safely.
During the rollout, an internal WAF testing tool turned out not to support the larger buffer. Because the tool was not needed at that time and had no customer-traffic effect, a second change was made to turn it off — this one through Cloudflare's global configuration system, which does not perform gradual rollouts: changes propagate to the entire fleet within seconds.
In the legacy FL1 proxy
(Lua-on-nginx / OpenResty), this second change hit a seven-
year-old unnoticed bug in the
rulesets engine's
killswitch path. The killswitch
correctly skipped evaluating the execute-action rule —
but the post-processing step then unconditionally dereferenced
rule_result.execute.results, which no longer existed because
the rule had been skipped. Lua threw a nil-index exception:
[lua] Failed to run module rulesets callback late_routing:
/usr/local/nginx-fl/lua/modules/init.lua:314:
attempt to index field 'execute' (a nil value)
Every affected request returned HTTP 500. Customers on the newer Rust-based FL2 proxy were unaffected — "In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." Customers on the China network were also unaffected.
The change was reverted at 09:11 UTC; all traffic restored by 09:12 UTC. This is the second major self-inflicted outage in three weeks — the [[sources/2025-11-18-cloudflare-outage-on-november-18-2025|18 November 2025 incident]] was acknowledged as structurally similar (single change propagating to the entire network) and the same remediation projects (staged rollouts for config, streamlined break-glass, fail-open data-plane error handling) are stated as still-incomplete.
Key takeaways¶
-
A seven-year-old dormant Lua bug detonated because a never-before-used code path finally ran. The bug was in the rulesets engine's killswitch subsystem — specifically the path that handles applying a killswitch to a rule with
action=execute, which had never been exercised in production before. When the killswitch fired, the evaluation code correctly skipped theexecute(so no sub-ruleset evaluation happened), but the post-processing code unconditionally didrule_result.execute.results = ruleset_results[...]. Because the rule had been skipped,rule_result.executewas nil — nil-index exception, HTTP 500. "This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems." Sibling wiki instance of concepts/latent-misconfiguration applied not to config but to code — dormant wrong code, gated by a precondition that happens not to hold, activated by the first time that precondition is released. (Canonical wiki instance of concepts/nil-index-lua-bug.) -
Strong-type-system languages structurally prevent this bug class; FL2 didn't break. The post states explicitly that the Rust rewrite in FL2 does not have the bug — not because it was re-found and fixed during the rewrite, but because "This type of code error is prevented by languages with strong type systems". In Rust, the equivalent of
rule_result.execute.results = ...requires unwrapping anOption<_>or pattern-matching a variant; the compiler does not let the code compile without handling the absent case. Pairs with concepts/memory-safety / Aurora DSQL / Dropbox Nucleus as another data-point for new code → safe language at the margin (the Android-team framing); the 2025-10 Workers / V8 post describes the same reflex at a different layer. (Canonical wiki instance of patterns/rust-replacement-of-dynamic-language-hot-path.) -
Cloudflare runs two proxy generations side-by-side mid- migration and the fault surfaced on the legacy one. Only customers on FL1 + Cloudflare Managed Ruleset were impacted; FL2 customers and China customers were fine. Same shape as the 2025-07-14 legacy-vs-strategic addressing-system incident: a long-running dual-system migration where the legacy surface carries the latent hazards and the strategic surface has moved past them. The stated remediation again is to accelerate migration rather than hardening the legacy surface in place.
-
The global configuration system is a second anycast-class single-action global-change surface. Distinct from the addressing / topology system that caused 2025-07-14 but with the same structural property: one edit → entire fleet in seconds, no canary, no health-mediated stages. The 2025-11- 18 incident post-mortem flagged exactly this system as "under review" — and two-and-a-half weeks later it was the delivery mechanism for this change. Cloudflare's stated remediation is staged rollouts with health validation and quick-rollback for all data used for rapid threat response and general configuration, not just BGP/topology. (Second canonical wiki instance of patterns/global-configuration-push as an antipattern surface; absence-of-pattern instance of patterns/progressive-configuration-rollout.)
-
Turning a thing OFF through the fast path is dangerous even when the thing is "not needed" and "has no customer-traffic effect". The first change (128 KB → 1 MB buffer) was being rolled out gradually — the right way. The second change (turning off the internal WAF testing tool so it wouldn't choke on the new buffer size) went through the global configuration system because it was internal and believed harmless. It was not harmless. Internal-tool changes that interact with customer-traffic code paths must live under the same safety discipline as customer-facing code: "critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers."
-
Fail-open vs fail-closed is explicit in the stated remediation. Cloudflare names "Fail-Open" Error Handling as the general remediation: "we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data- plane components. If a configuration file is corrupt or out- of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios." The 12-05 bug was a hard-fail Lua exception; a fail-open stance on unexpected rules-engine errors would have degraded to serve-without-WAF-scoring rather than serve 500. (Canonical wiki instance of concepts/fail-open-vs-fail-closed.)
-
Public post-mortems continue to name the missing discipline, not just the specific bug. Same pattern as the 1.1.1.1 and 11-18 posts: the RCA identifies the class of fix (progressive rollout, break-glass, fail-open) and names the legacy surfaces that lack it. The blog closes by locking down all changes to the network until better mitigation and rollback systems are in place — an unusually strong operational stance.
Timeline¶
| Time (UTC) | Status | Description |
|---|---|---|
| 08:47 | INCIDENT start | Configuration change deployed and propagated to the network |
| 08:48 | Full impact | Change fully propagated |
| 08:50 | INCIDENT declared | Automated alerts |
| 09:11 | Change reverted | Configuration change reverted and propagation start |
| 09:12 | INCIDENT end | Revert fully propagated, all traffic restored |
Elapsed: 25 minutes customer-visible impact. 3 minutes start-to-alert, 21 minutes alert-to-revert-initiated, 1 minute revert-to-full-restoration.
Mechanism¶
1. The buffer-size change (benign)¶
Cloudflare's WAF buffers HTTP request bodies in memory for analysis. Buffer size had been 128 KB. To protect customers against CVE-2025-55182 in React Server Components, Cloudflare was rolling out a new buffer cap of 1 MB (the default Next.js limit). This rollout went through the gradual deployment system — canary, staged, health-monitored — and did not cause the incident.
2. The WAF-testing-tool disable (the trigger)¶
The internal WAF testing tool did not support the new buffer size. Because the tool was not needed at that moment and had no customer-traffic effect, it was disabled. This change went via the global configuration system, which propagates within seconds to the entire fleet — no canary, no staged rollout, no per-POP health gating. The same system is "under review following the outage we experienced on November 18".
3. The rulesets engine¶
Cloudflare's rulesets engine evaluates sets of (filter, action)
rules against each request. Typical actions: block, log,
skip. The execute action triggers evaluation of a
sub-ruleset — an escape hatch used by Cloudflare's internal
logging system to evaluate new (not-yet-public) test rules.
The top-level ruleset runs execute → the sub-ruleset of test
rules evaluates → results flow back up.
4. The killswitch subsystem¶
The rulesets engine has a killswitch that can disable a misbehaving rule. The killswitch receives its input from the global configuration system. There is a well-defined SOP for using it, which was followed correctly here. The killswitch is the mechanism by which the internal test-rules ruleset was disabled.
5. The seven-year dormant bug¶
The killswitch had never before been applied to a rule with
action=execute. When it was:
- Rule evaluation code correctly skipped the
executeaction (did not evaluate the sub-ruleset). - Rule-result post-processing then ran:
- Because the rule had been skipped,
rule_result.execute(the object holding the execute-action metadata) did not exist. Lua threw a nil-index exception:
The code had existed for years, never fired this path, and was not caught by code review / testing / production traffic — a classic untriggered-path bug. Cloudflare explicitly attributes the class to dynamic typing: "This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur."
6. The affected customer set¶
Both conditions had to hold for a customer to be impacted:
- Traffic served by the legacy FL1 proxy (not the newer FL2 Rust rewrite).
- Had the Cloudflare Managed Ruleset deployed.
Requests for websites in this configuration returned HTTP 500
for everything except a small set of test endpoints like
/cdn-cgi/trace. Customers not matching both conditions —
including the entire China network — were unaffected. The
impacted subset was ~28% of Cloudflare's HTTP traffic.
Stated remediation¶
Cloudflare names three resiliency project families (projects the company committed to after the 2025-11-18 incident that were not yet deployed on 12-05):
- Enhanced Rollouts & Versioning — apply code-style progressive-deployment + health validation + rollback to all data used for rapid threat response and general configuration. Covers the global configuration system.
- Streamlined break-glass capabilities — ensure critical operations remain available under additional types of failures, for internal services and all standard customer- facing control-plane entry points.
- "Fail-Open" Error Handling — replace hard-fail logic in critical data-plane components; corrupt / out-of-range config → log + fall back to known-good state, or pass traffic without scoring, rather than drop. Some services to offer per-customer fail-open vs fail-closed choice. Drift-prevention continuously enforces this.
A fuller breakdown is promised within the week. In the interim, all changes to the network are locked down until better mitigation and rollback systems are in place.
Caveats / context¶
- No customer data compromised. Availability incident, not confidentiality.
- No attack / malicious activity involved. Neither the React-CVE patch effort that motivated the first change nor the outage itself involved an attacker. The CVE is an industry-wide React Server Components bug, disclosed that week; Cloudflare's WAF rule to mitigate it was an underlying-ecosystem response, not an attacker response.
- Two global outages in three weeks. The 2025-11-18 incident ("longer availability incident") and this one share the structural property that a single change propagated to the entire network and broke nearly all customers. Cloudflare acknowledges the projects that would have prevented both are not yet complete.
- No per-service throughput / capacity / fleet-size numbers. The post is an RCA, not a capacity piece. The one scaled number given is the 28% traffic share of affected customers.
- The "28% of all HTTP traffic" figure is the affected intersection, not the FL1-vs-FL2 split. Cloudflare does not disclose the FL1/FL2 traffic split directly; one can infer FL1 share is ≥ 28% but no higher bound is given.
- Cloudflare's specific bug-class attribution — "languages with strong type systems" — is unusually explicit for a post-mortem. Most vendors either fix the bug without language-blame or announce a rewrite without pointing at a specific class. Cloudflare does both.
- The internal WAF testing tool is not a customer product — it's a production-traffic-evaluation environment for new rules before they're publicly rolled out. Its being load-bearing-on- the-hot-path is itself a finding: the request-processing code path structurally couples to the testing-tool configuration, so disabling the tool mutated the request path.
Source¶
- Original: https://blog.cloudflare.com/5-december-2025-outage/
- Raw markdown:
raw/cloudflare/2025-12-05-cloudflare-outage-on-december-5-2025-d9d30283.md
Related¶
- systems/cloudflare-fl1-proxy — the affected legacy Lua-on-nginx proxy
- systems/cloudflare-fl2-proxy — the Rust-based successor where the bug doesn't occur
- systems/cloudflare-rulesets-engine — the rule / action / execute / killswitch subsystem that detonated
- systems/cloudflare-waf — the body-parsing layer whose buffer change was the benign first change
- systems/cloudflare-managed-ruleset — the customer-facing ruleset configuration required for impact
- concepts/killswitch-subsystem — the mechanism that ran the never-before-exercised code path
- concepts/nil-index-lua-bug — the dynamic-typing failure class that produced the exception
- concepts/global-configuration-system — the single-action global-change surface this incident rode on
- concepts/fail-open-vs-fail-closed — the stated remediation stance for data-plane errors
- patterns/global-configuration-push — the antipattern the global config system currently is
- patterns/progressive-configuration-rollout — the missing discipline Cloudflare states will be applied
- patterns/rust-replacement-of-dynamic-language-hot-path — the structural rewrite that made FL2 immune
- patterns/fast-rollback — the 09:11 → 09:12 revert was fast; the rollout was not
- patterns/staged-rollout — sibling to progressive-configuration-rollout for code
- sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — prior Cloudflare post-mortem naming the same missing progressive-config discipline on a different legacy surface (addressing)
- concepts/latent-misconfiguration — the config-surface sibling of this code-surface dormant-bug class
- concepts/memory-safety / concepts/program-correctness — bug classes strong-type-system languages structurally prevent
- companies/cloudflare