Skip to content

CLOUDFLARE 2025-07-16 Tier 1

Read original ↗

Cloudflare 1.1.1.1 incident on July 14, 2025

Summary

Cloudflare post-mortem on the 62-minute global outage of the systems/cloudflare-1-1-1-1-resolver|1.1.1.1 public DNS Resolver on 2025-07-14, 21:52 – 22:54 UTC. The root cause was not an attack and not a BGP hijack; it was an internal configuration error. On 2025-06-06, a release preparing a topology for a future Data Localization Suite (DLS) service accidentally linked the 1.1.1.1 Resolver's IP prefixes to the new (non-production) DLS service topology. The change made no routing difference at the time because the DLS service wasn't live — so no alerts fired, and the misconfiguration sat dormant for 38 days as a latent misconfig. On 07-14 at 21:48 UTC a second, seemingly innocuous change (attaching a test location to that same non-production DLS topology) triggered a global refresh of network configuration; the dormant link meant the 1.1.1.1 Resolver's topology collapsed from "all locations" to "one offline location", and all 1.1.1.1 prefixes were withdrawn via BGP from Cloudflare data centers globally — including 1.1.1.0/24, 1.0.0.0/24, 2606:4700:4700::/48, 162.159.36.0/24, 162.159.46.0/24, 172.64.36.0/24, 172.64.37.0/24, 172.64.100.0/24, 172.64.101.0/24, 2606:54c1:13::/48, 2a06:98c1:54::/48. UDP / TCP / DoT query rates dropped sharply; DoH traffic via cloudflare-dns.com was mostly unaffected because it resolves to different IPs. Alerts fired 13 minutes in; a revert at 22:20 UTC restored ~77% of traffic instantly (prefixes re-announced), but ~23% of edge servers had already been reconfigured to drop the required IP bindings and had to go back through the change-management system (normally a multi-hour progressive rollout, accelerated here after testing-location validation). Full restoration by 22:54 UTC. Unrelated but coincident: Tata Communications (AS4755) started advertising 1.1.1.0/24 at 21:54 — visible only because Cloudflare withdrew; not a cause of the outage. Remediation: deprecate the legacy IP-topology system (which does not use progressive/canary deployment), migrate fully onto the strategic system that does.

Key takeaways

  1. Internal config errors can produce anycast-scale global outages. The 2025-06-27 incident Cloudflare cross-links in the post was a BGP hijack; this one is the opposite class — Cloudflare themselves withdrew the routes. Same end-user impact (1.1.1.1 unreachable globally) via a completely different failure mode. concepts/anycast is the reach multiplier in both directions: advertising from everywhere wins latency and DDoS; withdrawing from everywhere loses the service globally in one step. (Canonical wiki instance of concepts/bgp-route-withdrawal as a self-inflicted global-outage primitive.)
  2. Dormant misconfiguration is the quiet long-tail failure mode. The bad link between the 1.1.1.1 prefixes and the DLS topology sat for 38 days with zero observable impact — "This configuration error sat dormant in the production network as the new DLS service was not yet in use, but it set the stage for the outage on July 14. Since there was no immediate change to the production network there was no end-user impact, and because there was no impact, no alerts were fired." The second change — a test- location attachment to a non-production service — was the activation trigger, not the bug. (Canonical wiki instance of concepts/latent-misconfiguration.)
  3. Configuration changes need the same canary discipline as code changes. The blog is explicit: "This model also has a significant flaw in that updates to the configuration do not follow a progressive deployment methodology: Even though this release was peer-reviewed by multiple engineers, the change didn't go through a series of canary deployments before reaching every Cloudflare data center." The fix is to deprecate legacy hard- coded-IP-list configuration and move onto the strategic system that does stage and health-monitor. (Canonical wiki instance of patterns/progressive-configuration-rollout — the control- plane config analogue of patterns/staged-rollout for code.)
  4. "Legacy + strategic in parallel" is a real operational risk surface during long migrations. Cloudflare was in the middle of migrating address/topology management from a legacy system ("hard-coding explicit lists of data center locations and attaching them to particular prefixes") to a strategic one ("describe service topologies without needing to hard-code IP addresses") that supports staged deployment with health monitoring. During migration both systems exist and must be synchronized, which is where the June 6 bug was introduced. The post-mortem's remediation is to accelerate deprecation — you don't linger in the hybrid state. (Canonical wiki instance of patterns/dual-system-sync-during-migration.)
  5. Anycast IP address vs. hostname is a load-bearing availability-architecture choice. Users with 1.1.1.1 hardcoded as their resolver lost service. Users with cloudflare-dns.com configured (typical for DoH) were mostly unaffected — "cloudflare-dns.com uses a different set of IP addresses". A hostname gives you a layer of indirection (and a separate set of prefixes) that can survive a single-prefix mistake; a hardcoded IP makes your client's availability coterminous with that one announcement. The practical wiki lesson: public-resolver users should prefer the hostname-based DoH endpoint when possible.
  6. Progressive rollout of IP bindings normally takes hours, for safety — and the incident itself forces a choice about accelerating it. Even after the revert at 22:20 UTC, ~23% of edge servers needed their IP bindings re-added via change- management, "which is not an instantaneous process by default for safety… the network in individual locations is designed to be updated over a course of multiple hours." Cloudflare accelerated the rollout after validating in testing locations. patterns/fast-rollback in a network fleet is explicitly two phases — the BGP re-announcement is fast (~instant), the server-side binding restore is slow-by-design.
  7. A BGP hijack can become visible through an unrelated withdrawal. Tata Communications' 1.1.1.0/24 advertisement at 21:54 UTC was "from the perspective of the routing system… exactly like a prefix hijack" — Cloudflare Radar even flagged it as one. But it was not causal; it was only visible because Cloudflare's own withdrawal cleared the path for Tata's advertisement to be seen. Incident responders have to disambiguate multiple concurrent anomalies that look related but aren't. (Sibling lesson to the 7.3 Tbps writeup: Cloudflare Radar is load-bearing observability.)

Incident timeline

Time (UTC) Event
2025-06-06 17:38 Issue introduced (no impact) — DLS-service topology config change accidentally references 1.1.1.1 Resolver prefixes; no routing change yet; no alerts
2025-07-14 21:48 Impact starts — test-location attachment to the same non-prod DLS service triggers a global config refresh; 1.1.1.1 prefixes begin to be withdrawn
2025-07-14 21:52 DNS traffic to 1.1.1.1 drops globally
2025-07-14 21:54 Tata Communications (AS4755) begins advertising 1.1.1.0/24 (non-causal BGP hijack, visible due to withdrawal)
2025-07-14 22:01 Impact detected — resolver alerts fire for query / proxy / DC failures
2025-07-14 22:01 Incident declared
2025-07-14 22:20 Fix deployed — revert to previous config; accelerated after testing-location validation
2025-07-14 22:54 Impact ends — resolver alerts cleared, traffic on resolver prefixes back to normal
2025-07-14 22:55 Incident resolved

Elapsed: 62 minutes customer-visible impact; 13 minutes start-to-detection; 19 minutes detection-to-initial-revert; 34 minutes from initial-revert to full restoration.

Affected prefixes

1.1.1.0/24
1.0.0.0/24
2606:4700:4700::/48
162.159.36.0/24
162.159.46.0/24
172.64.36.0/24
172.64.37.0/24
172.64.100.0/24
172.64.101.0/24
2606:54c1:13::/48
2a06:98c1:54::/48

Any traffic to Cloudflare via these IPs on the 1.1.1.1 Resolver service was impacted. UDP / TCP / DoT: significant drop. DoH via cloudflare-dns.com: mostly unaffected (different IP set). Some UDP traffic on unrelated IPs: also mostly unaffected.

Mechanism: dual topology systems mid-migration

Cloudflare manages service topologies via both a legacy and a strategic system, kept in sync during the migration:

  • Legacy: hard-code explicit lists of data center locations and attach them to particular prefixes. Adding a DC means updating many lists consistently. Configuration updates do not follow a progressive deployment methodology: a peer-reviewed change goes to every data center at once — no canaries, no health-monitored stages, no automated rollback.
  • Strategic: describe service topologies without hard-coded IP addresses. Accommodates new locations and customer scenarios; allows staged deployment so changes propagate with health monitoring and can roll back on regression.

During migration both must exist and be synchronized — which is where the 06-06 bug was introduced. The 07-14 remediations:

  • Stage addressing deployments. Replace blast-at-once with gradual, staged, health-mediated deployment so problems surface early.
  • Deprecate legacy systems. Accelerate migration off the legacy hard-coded-list approach; raise documentation/test-coverage standards during the transition window.

Availability-architecture detail: DoH vs DoT/UDP/TCP

Cloudflare called out that DoH (DNS-over-HTTPS) traffic remained "relatively stable" during the incident because most DoH users use the hostname cloudflare-dns.com (configured manually or via browser), not an IP literal — and that hostname resolves to a different set of IP addresses that were not withdrawn. UDP / TCP / DoT users typically configure 1.1.1.1 / 1.0.0.1 / 2606:4700:4700::1111 / 2606:4700:4700::1001 directly, so they lost service.

Generalizable rule of thumb: for a public anycast service, traffic that passes through a hostname lookup before reaching the service has an extra layer of indirection that can survive a single-prefix mistake; traffic that's configured to hit an IP literal directly has its fate tied 1:1 to that announcement. This is a distinct axis from protocol security (DoH encrypts the DNS query; DoT does too; plain UDP doesn't) and happens to fall the same way in the 1.1.1.1 case, which is why "use DoH" is a fair answer here for more than one reason.

Caveats / context

  • This is the second major 1.1.1.1 outage Cloudflare has publicly written up. The 2024-06-27 incident was a BGP hijack by a different AS; this one (2025-07-14) was a self-inflicted withdrawal. Different mechanisms, same user-visible symptom.
  • No customer data was compromised. This is a reachability / availability incident, not a confidentiality one.
  • No serving-infrastructure numbers (global QPS, server count, POP count) are given — the post is an RCA, not a capacity piece.
  • The revert was fast; the fleet reconfiguration was not. The 62-minute number is misleading without the internal breakdown: 28 minutes of dropping ~all traffic, then 34 minutes at reduced (~77%) capacity until server-side bindings were restored.
  • The 06-06 → 07-14 latency period (38 days of dormant misconfig) is a lower bound on how long this class of bug can hide. The activation event is whatever second change touches the same configuration surface — unpredictable.
  • Cloudflare's remediation focuses on process, not the specific bug. No root-cause-analysis on why the 06-06 PR reviewers missed the cross-link; the fix is structural (deprecate the system that allowed blast-at-once, move to the one that stages).

Source

Last updated · 200 distilled / 1,178 read