Skip to content

CLOUDFLARE 2025-11-18 Tier 1

Read original ↗

Cloudflare outage on November 18, 2025

Summary

On 2025-11-18 at 11:20 UTC, Cloudflare's network began experiencing significant failures to deliver core network traffic. Core traffic was largely flowing as normal by 14:30 UTC; all systems fully restored by 17:06 UTC~3 hours of core-traffic impact, ~5 hours 46 minutes end-to-end.

Root cause was not an attack. A gradual permission- management migration on a ClickHouse database cluster — rolling out explicit grants so that distributed queries could run under the initiating user account rather than a shared system account — changed the result of an existing metadata query used by Bot Management's feature-file generator. The query (SELECT name, type FROM system.columns WHERE table = 'http_requests_features' ORDER BY name) did not filter by database name. Before the migration, users could only see columns from the default database; after the migration, users with the new explicit grants saw columns from both default and r0 (where the underlying shard tables live). The feature-file generator therefore produced a file with ~doubled rows — each feature duplicated once for default and once for r0.

The fixed-size feature-file size limit in the Bot Management module on Cloudflare's new Rust-based FL2 proxy was 200 features (well above the ~60 actually in use — headroom was a preallocated-memory optimization). The doubled file blew past 200; Rust .unwrap() on the bounds check panicked; the fl2_worker_thread died on every request that reached the bots module; HTTP 5xx. Customers on the legacy FL1 proxy did not 5xx but all requests got a bot score of 0 — rules that blocked "bot traffic" saw a flood of false positives.

The feature file was regenerated every 5 minutes by a distributed ClickHouse query. Because the permission migration was being rolled out gradually across ClickHouse nodes, the file alternated good/bad depending on which node ran the query — producing ~5-minute on/off-on-off oscillations that mimicked a hyperscale DDoS attack. Cloudflare's own status page (hosted off Cloudflare infrastructure, with no dependencies on it) went down coincidentally at the same time — deepening the incorrect DDoS suspicion. Teams initially investigated the wrong hypothesis for about 40 minutes.

Additional impacted internal products: Workers KV (core-proxy-dependent), Access (dependent on Workers KV and the core proxy), Turnstile (challenge widget blocked most Dashboard logins), Email Security (reduced spam-detection accuracy, no critical customer impact), Cloudflare Dashboard (inaccessible to users without active sessions because of Turnstile).

Resolution: stopped the generation and propagation of the bad feature file; inserted a known-good file into the feature-file distribution queue; forced a core-proxy restart. Fix began taking effect at 14:30 UTC; remaining long tail of services restarted by 17:06 UTC.

This is Cloudflare's worst outage since 2019 — the first since 2019 to stop the majority of core traffic flowing through the network.

Key takeaways

  1. A permissions-management migration on an upstream database caused a downstream feature-file to double in size — a classical transitive-dependency-through-SQL-metadata bug. The ClickHouse migration was itself correct: it added explicit grants so that distributed-subquery privilege checks could run under the initiating user, not a shared system account (sensible least-privilege hardening). But a separate piece of code (Bot Management's feature-file generator) had an implicit assumption — that a system.columns WHERE table = ... query returned only rows from the default database. That assumption became false the moment the grant made r0's columns visible. Neither team could reasonably have caught this at review time: the ClickHouse team didn't own the feature-file generator, and the Bot Management team didn't own the ClickHouse grants. The bug surfaced only at the intersection of the two systems. Canonical wiki instance of concepts/database-permission-migration-risk.

  2. Preallocated-memory-budget size limits on internally- generated data are a fragile load-bearing invariant. The Bot Management module on FL2 uses a fixed-size in-memory feature file with a 200-feature cap, set "well above" the ~60 features actually in use for performance reasons (preallocated memory = no per-request allocation, no GC pressure, no out-of-cache surprises). This is a legitimate hot-path optimization. But it makes the size cap a load-bearing invariant that must always hold, and the data feeding it was internally generated — outside any ingestion-validation discipline Cloudflare applies to customer input. Canonical wiki instance of concepts/preallocated-memory-budget as a reliability- hazard surface.

  3. Internally-generated config files must be treated like user-generated input. Cloudflare's stated #1 remediation project: "Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user- generated input." The failure class is: "we trust this because we generate it" — and then a change three systems removed invalidates the trust. Canonical wiki instance of patterns/harden-ingestion-of-internal-config and concepts/internally-generated-untrusted-input.

  4. Rust .unwrap() on a bounds check panicked the worker thread; there was no fail-open path. The FL2 code path that actually crashed:

    thread fl2_worker_thread panicked:
      called Result::unwrap() on an Err value
    
    This is a Rust-style Option<T>/Result<T,E> .unwrap() where the Err path was not handled — behaviorally equivalent to the nil-index exception in the 2025-12-05 Lua FL1 outage but in the opposite proxy generation. Cloudflare's stated remediation: "Reviewing failure modes for error conditions across all core proxy modules." The 12-05 outage then names the same project as "Fail-Open" Error Handling — and frames it as still-incomplete. Canonical wiki instance of concepts/unhandled-rust-panic as a failure class that does not care whether you wrote the code in Lua or Rust: the absence of a fail-open path is the bug.

  5. Intermittent good/bad alternation made the incident look like an attack. The feature file regenerated every 5 minutes; the ClickHouse permission grants were rolled out gradually across cluster nodes; some runs landed on migrated nodes (bad file), others on non-migrated nodes (good file). The global network oscillated good/bad on a 5-minute cadence until every ClickHouse node had the grant (at which point the failure stabilized). This is very unusual for an internal bug — most internal bugs are monotone — and matches the attack-signature shape of a sophisticated DDoS probe. Cloudflare's status page going down coincidentally at the same time reinforced the attack hypothesis. The team spent roughly the first 40 minutes on the DDoS path before getting to Bot Management. Canonical wiki instance of concepts/intermittent-failure-signal-confusion.

  6. The "generated every 5 minutes, propagated to the entire network" feature-file delivery channel is a global configuration system — a single surface where one bad input reaches the entire fleet in seconds. Distinct-but-sibling to the addressing-system surface that caused 2025-07-14 and the global-config surface that caused 2025-12-05. Cloudflare's stated remediation: "Enabling more global kill switches for features" — i.e., accept that rapid global propagation is required for threat response, but give yourself an orthogonal fast-off path so a feature consuming a bad config can be disabled independently of trying to clean up the config. Canonical wiki instance of patterns/global-feature-killswitch as the structural companion to patterns/progressive-configuration-rollout — progressive rollout bounds how fast bad config reaches the fleet; global killswitch bounds how fast a feature consuming bad config can be taken out of the hot path.

  7. Debugging / observability systems amplified the blast radius. The post notes: "large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information" — latency spiked beyond the direct 5xx impact because every uncaught panic triggered enhanced- error-info generation, and the volume overwhelmed the nodes. Cloudflare's stated remediation: "Eliminating the ability for core dumps or other error reports to overwhelm system resources." Error-handling systems are not exempt from blast-radius discipline.

  8. CDN-scale operators have a structural obligation to the public Internet that most software companies don't. Cloudflare's framing: "Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today." The incident made the front page of every major newspaper because it affected Internet access for a significant fraction of the world's users. The non-technical blast radius — people couldn't check their email, log into banking, get answers from AI assistants — is the first-order cost of edge-CDN centralization. Canonical wiki instance of concepts/cdn-as-critical-internet-infrastructure.

Timeline

Time (UTC) Status Description
11:05 Normal ClickHouse permission-management change deployed (gradual rollout begins)
11:20 Incident start Failures begin to observe core network traffic delivery failures
11:28 Impact visible Deployment reaches customer environments; first customer-HTTP-traffic errors
11:31 Automated alert First automated test detects the issue
11:32 Manual investigation Team begins manual investigation
11:32–13:05 DDoS hypothesis investigated Team focuses on Workers KV elevated errors + status-page coincidence; suspects hyperscale attack
11:35 Incident call created
13:04–13:05 Workers KV + Access bypass deployed Workers KV + Access fall back to a prior core proxy version — reduces impact for downstream services
13:37 Root cause identified Team identifies Bot Management + feature file as the cause; begins preparing rollback
14:24 Bad file generation stopped Automatic deployment of new Bot Management configuration files halted
14:24 Good file deployed to test Test of known-good file completes successfully
14:30 Main impact resolved Correct Bot Management configuration file deployed globally; most services recover
14:40–15:30 Dashboard impact (second wave) Backlogged login retries overwhelm control plane; scaled concurrency, restored ~15:30
17:06 All services resolved Remaining long tail of services restarted; 5xx volume fully normal

Elapsed: - ~3 hours 10 min of core-traffic impact (11:20 → 14:30). - ~5 hours 46 min end-to-end (11:20 → 17:06). - ~2 hours 5 min from incident start to root-cause identification (11:20 → 13:37) — the time spent on the DDoS hypothesis.

Mechanism

1. The ClickHouse permission migration (upstream trigger)

Cloudflare's internal ClickHouse cluster hosts Bot Management's telemetry + feature data. Data is sharded; distributed queries target a default database's virtual Distributed tables, which fan out to r0.* underlying shard tables on each node.

Historically, distributed queries ran under a shared system account. Cloudflare was migrating to a model where distributed subqueries run under the initiating user's account — so that resource limits + access grants apply per-user correctly, and one bad subquery from one user cannot harm others. To support this, users received explicit grants on the r0 underlying tables (they already had implicit access through the distributed tables).

The migration was being rolled out gradually across ClickHouse cluster nodes. Grant at 11:05 UTC.

2. The metadata-visibility change (second-order effect)

Before the grants: SELECT ... FROM system.columns WHERE table = 'http_requests_features' returned only default- database columns (the user couldn't see r0).

After the grants: the same query returned columns from both default and r0. Same table name, two database namespaces both visible. Row count roughly doubled.

The query in question:

SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

Note: it filters on table but not on database. This was fine for years because users could only see one database.

3. The feature-file generation (downstream consumer)

Bot Management's feature- file generator ran the above query every 5 minutes to discover available features, then built a binary feature file used by the FL2 proxy's bots module to score every incoming request. The feature file is propagated to the entire Cloudflare network minutes after generation.

The doubled-row result produced a feature file with doubled feature count.

4. The size-limit panic on FL2 (Rust, crash path)

The bots module on FL2 preallocates memory for features for performance reasons. The preallocation cap is 200 features, well above the ~60 actually used. Accepting more than 200 would require runtime allocation, branching code paths, and extra GC attention — so the implementation bounds-checks at load time.

The check is implemented as a Rust .unwrap() on a Result<T,E> (or equivalent bounds check that panics on over-limit):

thread fl2_worker_thread panicked:
  called Result::unwrap() on an Err value

Every FL2 worker thread that tried to load the oversized file panicked. Every request that hit the bots module returned HTTP 5xx.

5. The bot-score-zero fallback on FL1 (Lua, silent path)

The legacy FL1 proxy was also affected but in a different way: FL1's bots module did not panic, but all requests received bot score = 0 (behaviorally equivalent to "this is 100% bot traffic"). For customers with rules like "block if bot score > 0.9", FL1 silently blocked everything. For customers not using bot score, FL1 was unaffected.

6. The oscillation that mimicked an attack

The ClickHouse permission migration was rolling out gradually across cluster nodes. Some query runs hit a node that already had the grant → bad feature file. Some query runs hit a node that did not yet have the grant → good feature file. The feature-file regeneration cadence (every 5 minutes) meant the network oscillated good/bad/good/bad/... on roughly a 5-minute period, with eventual stabilization at "bad" when every node had the grant.

This oscillation pattern is very uncharacteristic of internal bugs — internal bugs are usually monotonic (either always-failing or always-working). Monotonic failure + monotonic recovery is the debugging profile an engineer has seen a thousand times. Oscillating failure is the profile of an external attacker doing rate-controlled probing. Cloudflare's status page going down at the same time (hosted off-Cloudflare but coincidentally broken) deepened the confusion. Teams spent the first ~40 minutes on the DDoS hypothesis.

7. The downstream product impact

  • Workers KV + Cloudflare Access (both core-proxy- dependent) started returning elevated 5xx. 13:05: teams bypassed the core proxy for both services, falling back to a prior proxy version — reduced impact on KV and on downstream services depending on KV (notably Access itself).
  • Turnstile (Cloudflare's challenge widget) failed to load. The Cloudflare Dashboard uses Turnstile on its login flow, so any user without an active dashboard session could not log in. Two dashboard-impact windows: 11:30–13:10 (due to Workers KV) and 14:40–15:30 (login-retry backlog after feature file was restored; scaled control-plane concurrency fixed this).
  • Email Security: temporary loss of access to an IP reputation source; reduced spam-detection accuracy, some Auto Move action failures. No critical customer impact.
  • Cloudflare Access authentication failures were widespread from incident start through 13:05 (when the Workers KV bypass was deployed). Existing Access sessions were unaffected — only new authentications failed.

8. Resolution path

At 13:37 UTC the team identified Bot Management + the feature file as the cause. Multiple workstreams proceeded in parallel:

  • Stop automatic deployment of new Bot Management files (14:24).
  • Test a known-good file on a subset of the network (14:24).
  • Push the known-good file globally and restart the core proxy (14:30 — main impact resolved).
  • Long tail of service restarts for services that had entered bad states (17:06 — all services resolved).

Stated remediation

Cloudflare names four resiliency project families:

  1. Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input. This is the core "internally-generated does not mean trusted" discipline — validate size, shape, value ranges, cross-field invariants before loading into the hot path. Canonical wiki instance of patterns/harden-ingestion-of-internal-config.

  2. Enabling more global kill switches for features. Accept that rapid global propagation of threat-response data is required (you don't want to delay a DDoS mitigation by canary-rollout hours). But give yourself an orthogonal fast-off path so that a feature consuming bad data can be disabled in seconds without needing to also clean up the data in parallel. Canonical wiki instance of patterns/global-feature-killswitch.

  3. Eliminating the ability for core dumps or other error reports to overwhelm system resources. The CPU spike from auto-enhanced-error-info generation amplified the blast radius. Observability systems operate on the hot path and must themselves be bounded.

  4. Reviewing failure modes for error conditions across all core proxy modules. The FL2 .unwrap() panic is the direct precedent for the [[sources/2025-12-05-cloudflare- outage-on-december-5-2025|12-05]] "Fail-Open" Error Handling project — which was named as still-incomplete ~17 days later.

Caveats / context

  • No customer data compromised. Availability incident, not confidentiality.
  • No attack / malicious activity involved. Initially suspected a hyperscale DDoS for about 40 minutes; later confirmed internal root cause.
  • "Worst outage since 2019." Cloudflare's framing: prior outages have taken down the dashboard, or a new feature, or a subset of services — but 2019 was the last time the majority of core traffic stopped flowing.
  • Two global outages in three weeks. This (11-18) and the 12-05 outage share the structural property that a single change propagated to the entire network. Cloudflare acknowledges on 12-05 that the projects that would have prevented both are not yet complete.
  • FL2 panicked (5xx); FL1 returned bot-score-0 (silent overblock). Bug surface differs by proxy generation; the root cause (oversized feature file) is the same. Unlike the 12-05 outage — which was only FL1 — this outage hit both generations.
  • The ClickHouse permission migration was itself correct. The bug is not in the grant. The bug is in the downstream-consumer code that made an unwritten assumption about query semantics. This is exactly the shape of a transitive-dependency bug that code review cannot catch because the two systems have separate owners.
  • The status-page coincidence is called out explicitly. The Cloudflare status page is intentionally hosted off Cloudflare infrastructure with no Cloudflare dependencies precisely so that a Cloudflare outage doesn't cascade. It went down for independent reasons at the same time — the post explicitly notes this is coincidental and deepened the attack suspicion.
  • Internal WAF testing tool, 12-05's trigger, was the second-order response to 11-18's remediation. The 12-05 post-mortem says the global configuration system was "under review following the outage we experienced on November 18" — i.e., the review was in progress but incomplete, and the system that was under review was the delivery mechanism for the 12-05 trigger. Compounding structural hazard across the two incidents.
  • No per-service throughput / capacity / fleet-size numbers. The post is an RCA, not a capacity piece. The one scaled number is ~60 features in use vs 200 preallocated.
  • No quantitative feature-file size numbers. "Doubled" feature count is the disclosure; no actual byte count.
  • Cloudflare's explicit class-level attribution to preallocation is unusually candid. Most vendors would describe the bug but not name preallocation + fixed cap + internally generated input as a class.

Source

Last updated · 200 distilled / 1,178 read