Cloudflare — When DNSSEC goes wrong: how we responded to the .de TLD outage¶
Summary¶
On 2026-05-05 ~19:30 UTC, DENIC — the registry
operator for the .de country-code top-level domain — started
publishing incorrect DNSSEC signatures for the .de zone
during a routine, scheduled key rollover.
Any DNSSEC-validating resolver receiving the
broken records was required by the spec to return
SERVFAIL to clients. Because
.de is consistently among the most broadly queried TLDs
globally, the failure had the potential to make millions of
domains unreachable. This post is Cloudflare's public write-up of
how its systems/cloudflare-1-1-1-1-resolver|1.1.1.1 public
resolver — and its internal origin-resolution service — absorbed
and then mitigated the event. Two structural behaviours are named
explicitly: serve-stale (RFC
8767) which cushioned user impact for the first 2–3 hours, and
the deployment of a Negative
Trust Anchor-equivalent override (RFC 7646) that ended impact
for 1.1.1.1 users at 22:17 UTC. The post also self-discloses a
bug in 1.1.1.1's Extended DNS
Error (EDE) code propagation — DNSSEC-Bogus errors were
surfacing as EDE 22 "No Reachable Authority" instead of EDE 6
"DNSSEC Bogus" — and commits to fixing it.
Key takeaways¶
-
A routine TLD key rollover misfired. DENIC's post-incident note (quoted in Cloudflare's write-up) states:
"The outage is linked to a routine, scheduled key rollover. During this process, non-validatable signatures were generated and distributed. As a precautionary measure, future rollovers have been suspended until the exact technical causes have been identified." — DENIC (via this post). Canonical instance of signing-key rotation failing at the phase-3→phase-4 gate (activating a new signing key before validators can verify against the published DNSKEY). See also DNSSEC chain of trust: a signature break at a TLD parent zone cascades to every child zone under it.
-
SERVFAIL climbed for 3 hours as caches expired. From the post: "After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing." Canonical illustration of TTL-gated recovery from an upstream break — the SERVFAIL rate is a function of how many cached records have expired, not of the break's severity. Also exposes DNS retry amplification: "a large increase in query volume … clients retry failed queries, often three or more times, inflating the raw numbers."
-
Serve-stale (RFC 8767) absorbed the first two hours of impact. From the post: "What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's 'serve stale' at work." 1.1.1.1 implements RFC 8767: "When upstream resolution fails, a resolver may continue serving expired cached records rather than returning an error. This significantly cushions the impact of an upstream outage, buying time for operators to respond." This is the canonical wiki instance of fail-stale applied at the DNS recursive-resolver altitude — serve the last-known-good record past its TTL when the upstream can't answer. See patterns/serve-stale-over-servfail.
-
Negative Trust Anchor (RFC 7646) ended impact at 22:17 UTC. A Negative Trust Anchor (NTA) tells a validating resolver to treat a specific zone "as if it were unsigned, bypassing validation for names under that zone". RFC 7646 explicitly names TLD misconfiguration as the primary use case. Cloudflare's 1.1.1.1 uses an internal resolver called Big Pineapple which "at this time, we have not implemented a native NTA mechanism. Instead, we used an existing override rule mechanism to mark
.deas an insecure zone, which causes all.dequeries to be resolved as if they don't have DNSSEC enabled. This is functionality equivalent to an NTA, though it is not formally defined in any RFC." Canonical instance of the patterns/negative-trust-anchor-for-tld-outage pattern. -
Cloudflare's internal origin resolver got the same treatment. "Cloudflare operates a separate internal resolver for origin resolution, distinct from our publicly available 1.1.1.1 service. To mitigate impact we applied a similar NTA for
.deon the internal resolver service, restoring origin connectivity for affected customers." Canonical example of the same mitigation applied at two resolver tiers simultaneously (public DNS for end-users + origin resolution for CDN customers). -
Deliberate security tradeoff documented explicitly. "The decision to bypass DNSSEC is a deliberate tradeoff. Without DNSSEC validation,
.dedomains become vulnerable to genuine attacks for the duration of the incident." Cloudflare named the tradeoff out loud, with the incident-room framing:"There is no user of 1.1.1.1 resolving a .de name right now who would prefer a SERVFAIL over an unvalidated response." Canonical articulation of how fail-open is the right failure-mode default for DNS validation when the break is widespread, publicly confirmed, and affecting every validating resolver equally.
-
Cross-operator coordination via DNS-OARC. Cloudflare communicated the mitigation via the DNS-OARC Mattermost — a chat room where major DNS-resolver operators coordinate. "Incidents like this also highlight why relationships between operators matter … Forums like DNS-OARC provide exactly this: shared mailing lists and chat rooms where operators can coordinate quickly across organizational boundaries when something goes wrong." Canonical wiki instance of cross-organisational DNS incident coordination as a first-class operational substrate — resolver operators "across the Internet independently applied Negative Trust Anchors within an hour, restoring resolution while DENIC worked to fix the zone."
-
Extended DNS Errors (RFC 8914) disclosed a bug in 1.1.1.1. EDEs give clients structured diagnostic codes alongside SERVFAIL. Correctly-behaving resolvers returned EDE 6 (DNSSEC Bogus) with a descriptive message pointing at the broken RRSIG:
1.1.1.1 returned EDE 22 (No Reachable Authority) instead. Root cause: "a bug in how we propagate DNSSEC EDE codes up from our trust chain verifier. When the verifier detects a bogus signature it creates the DNSSEC Bogus EDE code, but this is never inserted into the response. Instead, the outer layer of the resolver sees a problem with recursive resolution with no error code and falls back to reporting 'No Reachable Authority.'" Canonical wiki instance of EDE as a self-disclosure surface — the incident made a latent error-propagation bug visible at production altitude. See also redundant error signalling: when the primary error channel (RCODE) is opaque, the supplementary channel (EDE) must actually work. -
TLD-level failure has structural blast radius. From the closing takeaways:
"This incident highlights a structural reality of the DNS hierarchy: when a registry at the TLD level fails, every domain under that TLD is affected simultaneously, regardless of where it's hosted or which resolver is used. This isn't unique to DNSSEC; the same is true if a TLD's nameservers become unreachable. The hierarchy that makes the global DNS work is also what makes failures at the top propagate downward." Canonical framing of TLD-level failure blast radius as a hierarchy property, not a DNSSEC-specific vulnerability.
-
DNSSEC is the enforcer, not the failure. Cloudflare explicitly rejects the "DNSSEC failed" reading:
"any technology that is misconfigured will risk breaking for users that rely on it. Leaving critical fiber cables exposed on the seabed for sharks to chew on does not invalidate the important role underwater cables pose … It only highlights that we've sometimes failed to accurately protect it." The validator's job is to reject broken signatures; the failure was in the upstream signing pipeline. DNSSEC surfaced it rather than caused it.
DNSSEC architecture walkthrough (in the post)¶
The post includes a concise DNSSEC primer that anchors several wiki concepts:
- What DNSSEC is: cryptographic authentication for DNS records — "each set of records is accompanied by a digital signature known as an RRSIG record that lets a resolver verify the records haven't been tampered with." About integrity, not privacy (contrasted with DoT/DoH which are encrypted transport but not authentication).
- Why signatures travel with records: "integrity can be verified regardless of how many caches or hops a response has passed through. A cached record is just as verifiable as a fresh one."
- Chain of trust: root zone → TLD → child zone, each
delegation linked by a DS (Delegation Signer) record in
the parent zone that contains a cryptographic hash of the
child's public key. "A break anywhere in that chain causes
validation to fail for everything below it, which is why a
misconfiguration at a TLD like
.deaffects every domain under it." — see concepts/dnssec-chain-of-trust. - Two key types per zone: Zone Signing Key (ZSK) signs record sets, Key Signing Key (KSK) signs the ZSK. The KSK's public half is what the parent's DS record points at. "Rotating a ZSK is relatively straightforward … Rotating a KSK is more involved, because the parent's DS record must also be updated, often requiring coordination with a registrar or registry."
- The rollover failure window: "During a key rotation, there is a critical window where the old key is being phased out and the new one phased in. If the signatures published in the zone are made with a key that resolvers cannot verify against the zone's published DNSKEY records … resolvers have no choice but to reject the responses and return SERVFAIL."
Operational numbers¶
- Incident start: 2026-05-05 19:30 UTC.
- 1.1.1.1 mitigation deployed: 22:17 UTC (≈2h 47m from incident start).
- SERVFAIL ramp: 3 hours steady climb as cached
.derecords expired. - Query-volume amplification: clients retrying failed queries "often three or more times" inflate raw query numbers during DNS incidents.
.descale: "consistently ranks among the most broadly queried TLDs globally" (per Cloudflare Radar).- Resolver mitigation timing across the community: "resolver operators across the Internet independently applied Negative Trust Anchors within an hour" of starting.
- Affected service surface:
- 1.1.1.1 public DNS (via Big Pineapple), which also powers 1.1.1.1 for Families, Gateway DNS, and DNS Firewall.
- Internal origin resolver used by Cloudflare's CDN for customer origin-name resolution.
Caveats¶
- Announcement + self-assessment altitude, not a full post-mortem. DENIC's detailed RCA had not yet been published; Cloudflare's post is structured around what they saw + what they did + two self-critiques (no native NTA implementation, EDE-propagation bug).
- No native NTA in Big Pineapple — Cloudflare used an
existing "override rule mechanism to mark
.deas an insecure zone" which is functionally equivalent to an NTA but not formally defined in any RFC. A native NTA implementation is implied future work. - No numbers on affected user population. The post quotes SERVFAIL-rate graphs but does not publish absolute counts of users, domains, or queries impacted.
- Upstream RCA pending. DENIC's own investigation had suspended future rollovers and was still identifying the technical causes at publication time. Any deeper mechanism claim about which part of the rollover pipeline misfired is not available in this post.
- Attack-window exposure during NTA is real. Bypassing
DNSSEC validation means
.dedomains were vulnerable to genuine DNS attacks for the duration of the mitigation. The tradeoff is acknowledged explicitly but not quantified.
Source¶
- Original: https://blog.cloudflare.com/de-tld-outage-dnssec/
- Raw markdown:
raw/cloudflare/2026-05-06-when-dnssec-goes-wrong-how-we-responded-to-the-de-tld-outage-2bf3061d.md - DENIC incident note: https://blog.denic.de/en/technical-issue-with-de-domains-resolved/
- RFC 7646 (Negative Trust Anchors): https://datatracker.ietf.org/doc/html/rfc7646
- RFC 8767 (Serving Stale Data): https://datatracker.ietf.org/doc/html/rfc8767
- RFC 8914 (Extended DNS Errors): https://datatracker.ietf.org/doc/html/rfc8914
- Big Pineapple intro: https://blog.cloudflare.com/big-pineapple-intro/
- DNS-OARC: https://dns-oarc.net/
Related¶
- systems/cloudflare-1-1-1-1-resolver
- systems/big-pineapple
- systems/denic
- systems/dns-oarc
- concepts/dnssec
- concepts/dnssec-chain-of-trust
- concepts/negative-trust-anchor
- concepts/extended-dns-errors
- concepts/tld-level-failure-blast-radius
- concepts/dns-servfail-response
- concepts/dns-resolver-caching
- concepts/signing-key-rotation-lifecycle
- concepts/fail-stale
- concepts/stale-while-revalidate-cache
- patterns/negative-trust-anchor-for-tld-outage
- patterns/serve-stale-over-servfail
- companies/cloudflare