Skip to content

CLOUDFLARE 2025-12-05 Tier 1

Read original ↗

Cloudflare outage on December 5, 2025

Summary

On 2025-12-05 at 08:47 UTC, a portion of Cloudflare's network began serving HTTP 500 errors for a subset of customers. The incident was resolved at 09:12 UTC~25 minutes of total impact. Approximately 28% of Cloudflare's total HTTP traffic was affected.

Root cause was not an attack. A WAF body-parsing change was being rolled out to protect customers against CVE-2025-55182 in React Server Components: the Cloudflare proxy's in-memory HTTP request-body buffer was being raised from 128 KB to 1 MB (the default Next.js limit). This first change was going out via gradual deployment — safely.

During the rollout, an internal WAF testing tool turned out not to support the larger buffer. Because the tool was not needed at that time and had no customer-traffic effect, a second change was made to turn it off — this one through Cloudflare's global configuration system, which does not perform gradual rollouts: changes propagate to the entire fleet within seconds.

In the legacy FL1 proxy (Lua-on-nginx / OpenResty), this second change hit a seven- year-old unnoticed bug in the rulesets engine's killswitch path. The killswitch correctly skipped evaluating the execute-action rule — but the post-processing step then unconditionally dereferenced rule_result.execute.results, which no longer existed because the rule had been skipped. Lua threw a nil-index exception:

[lua] Failed to run module rulesets callback late_routing:
/usr/local/nginx-fl/lua/modules/init.lua:314:
attempt to index field 'execute' (a nil value)

Every affected request returned HTTP 500. Customers on the newer Rust-based FL2 proxy were unaffected — "In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." Customers on the China network were also unaffected.

The change was reverted at 09:11 UTC; all traffic restored by 09:12 UTC. This is the second major self-inflicted outage in three weeks — the [[sources/2025-11-18-cloudflare-outage-on-november-18-2025|18 November 2025 incident]] was acknowledged as structurally similar (single change propagating to the entire network) and the same remediation projects (staged rollouts for config, streamlined break-glass, fail-open data-plane error handling) are stated as still-incomplete.

Key takeaways

  1. A seven-year-old dormant Lua bug detonated because a never-before-used code path finally ran. The bug was in the rulesets engine's killswitch subsystem — specifically the path that handles applying a killswitch to a rule with action=execute, which had never been exercised in production before. When the killswitch fired, the evaluation code correctly skipped the execute (so no sub-ruleset evaluation happened), but the post-processing code unconditionally did rule_result.execute.results = ruleset_results[...]. Because the rule had been skipped, rule_result.execute was nil — nil-index exception, HTTP 500. "This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems." Sibling wiki instance of concepts/latent-misconfiguration applied not to config but to code — dormant wrong code, gated by a precondition that happens not to hold, activated by the first time that precondition is released. (Canonical wiki instance of concepts/nil-index-lua-bug.)

  2. Strong-type-system languages structurally prevent this bug class; FL2 didn't break. The post states explicitly that the Rust rewrite in FL2 does not have the bug — not because it was re-found and fixed during the rewrite, but because "This type of code error is prevented by languages with strong type systems". In Rust, the equivalent of rule_result.execute.results = ... requires unwrapping an Option<_> or pattern-matching a variant; the compiler does not let the code compile without handling the absent case. Pairs with concepts/memory-safety / Aurora DSQL / Dropbox Nucleus as another data-point for new code → safe language at the margin (the Android-team framing); the 2025-10 Workers / V8 post describes the same reflex at a different layer. (Canonical wiki instance of patterns/rust-replacement-of-dynamic-language-hot-path.)

  3. Cloudflare runs two proxy generations side-by-side mid- migration and the fault surfaced on the legacy one. Only customers on FL1 + Cloudflare Managed Ruleset were impacted; FL2 customers and China customers were fine. Same shape as the 2025-07-14 legacy-vs-strategic addressing-system incident: a long-running dual-system migration where the legacy surface carries the latent hazards and the strategic surface has moved past them. The stated remediation again is to accelerate migration rather than hardening the legacy surface in place.

  4. The global configuration system is a second anycast-class single-action global-change surface. Distinct from the addressing / topology system that caused 2025-07-14 but with the same structural property: one edit → entire fleet in seconds, no canary, no health-mediated stages. The 2025-11- 18 incident post-mortem flagged exactly this system as "under review" — and two-and-a-half weeks later it was the delivery mechanism for this change. Cloudflare's stated remediation is staged rollouts with health validation and quick-rollback for all data used for rapid threat response and general configuration, not just BGP/topology. (Second canonical wiki instance of patterns/global-configuration-push as an antipattern surface; absence-of-pattern instance of patterns/progressive-configuration-rollout.)

  5. Turning a thing OFF through the fast path is dangerous even when the thing is "not needed" and "has no customer-traffic effect". The first change (128 KB → 1 MB buffer) was being rolled out gradually — the right way. The second change (turning off the internal WAF testing tool so it wouldn't choke on the new buffer size) went through the global configuration system because it was internal and believed harmless. It was not harmless. Internal-tool changes that interact with customer-traffic code paths must live under the same safety discipline as customer-facing code: "critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers."

  6. Fail-open vs fail-closed is explicit in the stated remediation. Cloudflare names "Fail-Open" Error Handling as the general remediation: "we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data- plane components. If a configuration file is corrupt or out- of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios." The 12-05 bug was a hard-fail Lua exception; a fail-open stance on unexpected rules-engine errors would have degraded to serve-without-WAF-scoring rather than serve 500. (Canonical wiki instance of concepts/fail-open-vs-fail-closed.)

  7. Public post-mortems continue to name the missing discipline, not just the specific bug. Same pattern as the 1.1.1.1 and 11-18 posts: the RCA identifies the class of fix (progressive rollout, break-glass, fail-open) and names the legacy surfaces that lack it. The blog closes by locking down all changes to the network until better mitigation and rollback systems are in place — an unusually strong operational stance.

Timeline

Time (UTC) Status Description
08:47 INCIDENT start Configuration change deployed and propagated to the network
08:48 Full impact Change fully propagated
08:50 INCIDENT declared Automated alerts
09:11 Change reverted Configuration change reverted and propagation start
09:12 INCIDENT end Revert fully propagated, all traffic restored

Elapsed: 25 minutes customer-visible impact. 3 minutes start-to-alert, 21 minutes alert-to-revert-initiated, 1 minute revert-to-full-restoration.

Mechanism

1. The buffer-size change (benign)

Cloudflare's WAF buffers HTTP request bodies in memory for analysis. Buffer size had been 128 KB. To protect customers against CVE-2025-55182 in React Server Components, Cloudflare was rolling out a new buffer cap of 1 MB (the default Next.js limit). This rollout went through the gradual deployment system — canary, staged, health-monitored — and did not cause the incident.

2. The WAF-testing-tool disable (the trigger)

The internal WAF testing tool did not support the new buffer size. Because the tool was not needed at that moment and had no customer-traffic effect, it was disabled. This change went via the global configuration system, which propagates within seconds to the entire fleet — no canary, no staged rollout, no per-POP health gating. The same system is "under review following the outage we experienced on November 18".

3. The rulesets engine

Cloudflare's rulesets engine evaluates sets of (filter, action) rules against each request. Typical actions: block, log, skip. The execute action triggers evaluation of a sub-ruleset — an escape hatch used by Cloudflare's internal logging system to evaluate new (not-yet-public) test rules. The top-level ruleset runs execute → the sub-ruleset of test rules evaluates → results flow back up.

4. The killswitch subsystem

The rulesets engine has a killswitch that can disable a misbehaving rule. The killswitch receives its input from the global configuration system. There is a well-defined SOP for using it, which was followed correctly here. The killswitch is the mechanism by which the internal test-rules ruleset was disabled.

5. The seven-year dormant bug

The killswitch had never before been applied to a rule with action=execute. When it was:

  • Rule evaluation code correctly skipped the execute action (did not evaluate the sub-ruleset).
  • Rule-result post-processing then ran:
    if rule_result.action == "execute" then
      rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
    end
    
  • Because the rule had been skipped, rule_result.execute (the object holding the execute-action metadata) did not exist. Lua threw a nil-index exception:
    [lua] Failed to run module rulesets callback late_routing:
    /usr/local/nginx-fl/lua/modules/init.lua:314:
    attempt to index field 'execute' (a nil value)
    

The code had existed for years, never fired this path, and was not caught by code review / testing / production traffic — a classic untriggered-path bug. Cloudflare explicitly attributes the class to dynamic typing: "This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur."

6. The affected customer set

Both conditions had to hold for a customer to be impacted:

Requests for websites in this configuration returned HTTP 500 for everything except a small set of test endpoints like /cdn-cgi/trace. Customers not matching both conditions — including the entire China network — were unaffected. The impacted subset was ~28% of Cloudflare's HTTP traffic.

Stated remediation

Cloudflare names three resiliency project families (projects the company committed to after the 2025-11-18 incident that were not yet deployed on 12-05):

  1. Enhanced Rollouts & Versioning — apply code-style progressive-deployment + health validation + rollback to all data used for rapid threat response and general configuration. Covers the global configuration system.
  2. Streamlined break-glass capabilities — ensure critical operations remain available under additional types of failures, for internal services and all standard customer- facing control-plane entry points.
  3. "Fail-Open" Error Handling — replace hard-fail logic in critical data-plane components; corrupt / out-of-range config → log + fall back to known-good state, or pass traffic without scoring, rather than drop. Some services to offer per-customer fail-open vs fail-closed choice. Drift-prevention continuously enforces this.

A fuller breakdown is promised within the week. In the interim, all changes to the network are locked down until better mitigation and rollback systems are in place.

Caveats / context

  • No customer data compromised. Availability incident, not confidentiality.
  • No attack / malicious activity involved. Neither the React-CVE patch effort that motivated the first change nor the outage itself involved an attacker. The CVE is an industry-wide React Server Components bug, disclosed that week; Cloudflare's WAF rule to mitigate it was an underlying-ecosystem response, not an attacker response.
  • Two global outages in three weeks. The 2025-11-18 incident ("longer availability incident") and this one share the structural property that a single change propagated to the entire network and broke nearly all customers. Cloudflare acknowledges the projects that would have prevented both are not yet complete.
  • No per-service throughput / capacity / fleet-size numbers. The post is an RCA, not a capacity piece. The one scaled number given is the 28% traffic share of affected customers.
  • The "28% of all HTTP traffic" figure is the affected intersection, not the FL1-vs-FL2 split. Cloudflare does not disclose the FL1/FL2 traffic split directly; one can infer FL1 share is ≥ 28% but no higher bound is given.
  • Cloudflare's specific bug-class attribution — "languages with strong type systems" — is unusually explicit for a post-mortem. Most vendors either fix the bug without language-blame or announce a rewrite without pointing at a specific class. Cloudflare does both.
  • The internal WAF testing tool is not a customer product — it's a production-traffic-evaluation environment for new rules before they're publicly rolled out. Its being load-bearing-on- the-hot-path is itself a finding: the request-processing code path structurally couples to the testing-tool configuration, so disabling the tool mutated the request path.

Source

Last updated · 200 distilled / 1,178 read