Skip to content

CLOUDFLARE 2026-01-19 Tier 1

Read original ↗

What came first: the CNAME or the A record?

Summary

Cloudflare post-mortem on the ~40-minute partial global outage of systems/cloudflare-1-1-1-1-resolver|1.1.1.1 on 2026-01-08, 17:40–19:55 UTC. Root cause not an attack, not BGP — a memory optimisation to the 1.1.1.1 cache implementation that reordered resource records inside DNS responses, placing CNAME records after the A/AAAA records they aliased instead of before. Most DNS clients handle either order, but a subset of widely-deployed stub resolvers — notably glibc's getaddrinfo() (via its getanswer_r implementation) and Cisco's DNSC process in three Catalyst switch models — parse the answer section sequentially keeping track of an expected name: when a CNAME appears first, the expected name is updated and the subsequent A record matches; when the A record comes first, it is discarded as not-matching, and resolution fails (or, on the Cisco switches, the DNSC process crashes and the switch enters a reboot loop — Cisco CSCvv99999 service advisory). The structural cause is 40-year-old ambiguity in RFC 1034 (1987): the spec uses the word "preface" ("the recursive response to a query will be one of the following: the answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer") — not the RFC 2119 normative MUST/SHOULD language introduced in 1997, ten years later. RFC 1034 §3.6 explicitly says "the order of RRs in a set is not significant" and §6.2.1 shows RR-order examples only for two A records in the same RRset — it never clarifies relative order between RRsets in a message section. RFC 4035 (DNSSEC) uses explicit MUST language about which records to include but still does not mandate a particular message-section order. Even with CNAMEs first, sequential parsing breaks if the CNAME chain itself is out of order, because each CNAME belongs to a different RRset (different owner name) and the "RRset order is insignificant" statement doesn't constrain inter-RRset ordering. Cloudflare reverted the change at 18:27 UTC and completed the revert by 19:55 UTC; remediation is (1) keep CNAMEs first for compatibility with the long tail of deployed stub resolvers, forever; (2) add tests asserting the order invariant in the 1.1.1.1 cache code path (the team had originally implemented CNAME-first but had no test, which is why the optimisation silently broke it); (3) file an IETF Internet-Draft at DNSOP to formalise the behaviour as a new RFC. The 14-message timeline in the raw shows the change entered the 1.1.1.1 codebase 2025-12-02, reached testing environment 2025-12-10, began global release 2026-01-07 23:48 UTC, reached 90 % of servers 2026-01-08 17:40 UTC (impact onset), incident declared 18:19 UTC (39 min later — slow by Cloudflare's usual standard, reflecting that most traffic was unaffected), reverted by 18:27, full impact ended 19:55.

Key takeaways

  1. The spec allows more behaviour than the install-base tolerates. RFC 1034's lack of normative ordering language is technically permissive, but a meaningful population of deployed stub resolvers (glibc getaddrinfo, the Cisco DNSC process) requires CNAME-first. "What the spec permits" and "what you can safely ship" are different sets; the second is smaller and determined by the long tail of unreplaced clients, some of which will never be updated (backward compat). (Source: "we believe it's best to require CNAME records to appear in-order before any other records".)

  2. Age of spec ≠ normativity. RFC 1034 predates RFC 2119's MUST/SHOULD conventions by 10 years. Interpreting old RFCs with modern normative intuitions mis-reads them in both directions: statements that read as firm requirements may be advisory, and advisory-sounding statements may carry decades of implementer convention that is load-bearing in practice. Cloudflare's original implementation chose CNAME-first consistent with the "preface" hint but did not test the invariant (test the ambiguous invariant) because the spec wasn't unambiguous enough to mandate the test — exactly the gap the memory optimisation fell through.

  3. RRset vs RR in message sections is a separate axis from RRset-internal order. RFC 1034 §3.6 is clear that within an RRset (same name, type, class) order is insignificant and the A-records-for-same-name example in §6.2.1 backs this up — but the spec treats message sections as sequences of RRs and never says whether different RRsets in a section have a defined relative order. Modern usage (DNSSEC: RRSIG RRsets and their covered RRsets; CNAME chains: CNAME RRsets with different owner names) lives entirely on the axis the spec skips. The Cloudflare piece is the clearest public disentangling of these two ordering axes.

  4. CNAME-first alone doesn't save you — the chain itself must be in order too. Sequential parsers track a single expected name and update it each time they encounter a CNAME whose owner matches; if cdn.example.com. CNAME server.cdn-provider.com. appears before www.example.com. CNAME cdn.example.com., the first CNAME is skipped because its owner doesn't match the initial QNAME, the second CNAME updates the expected name, and the first CNAME is never revisited. Implementations that parse into an ordered set first (e.g. systemd-resolved's DnsAnswerItem) handle this correctly regardless of on-wire order. RFC 1034 does not require CNAME chains to be self-ordered either; the Internet-Draft proposes requiring both.

  5. Stub resolvers ≠ recursive resolvers. RFC 1034's "resolver simply restarts the query at the new name when it encounters a CNAME" guidance (§5.2.2) was written with full recursive resolvers in mind (1.1.1.1 itself, BIND, Unbound), not the stub resolvers (glibc getaddrinfo, Cisco switch DNSC, embedded DNS clients) that most end-user applications actually go through. Stubs don't restart queries — they just parse the response from their configured recursive and extract addresses. The stub/recursive split is load-bearing for understanding which half of the DNS population implements which half of the RFC.

  6. A "zero-behaviour-change" memory optimisation wasn't. The patch replaced `Vec::with_capacity(…) + extend_from_slice + …

  7. extend_from_slicewith a directentry.answer.extend(self. records)`, saving one allocation and one copy per partially- expired CNAME cache hit. Functionally equivalent under RFC 1034's "order is not significant" reading, observably different on the wire — the chain that used to be CNAMEs- then-A is now A-then-CNAMEs. This is a latent-defect-shaped bug in the same family as the 2025-07-14 1.1.1.1 outage: a seemingly-safe refactor on a surface whose invariant isn't captured by automated tests ships into production and detonates on the long tail of strict downstream consumers.

  8. Same-day-fix, single-cause revert. 39 min to declare, 8 min from declaration to revert (18:19 → 18:27 UTC), 88 min from revert to full impact-end (fleet-wide redeploy). The fast-rollback path worked cleanly because the change was a single commit on a single code path; compare to the 2025-07-14 outage (same surface, different defect class) where ~23 % of edge servers had already lost IP bindings by revert time and required a second slower pass through the change-management system.

  9. RFC 4035 (DNSSEC) shows the alternative: MUST + higher priority. "When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included." This is the normative shape Cloudflare is proposing for CNAME ordering. The Internet-Draft (draft-jabley-dnsop-ordered-answer-section) at IETF DNSOP is the stated vehicle.

Systems named

  • systems/cloudflare-1-1-1-1-resolver — the affected service.
  • systems/glibc-getaddrinfo — affected stub resolver implementation on Linux; getanswer_r's sequential expected-name parse drops the A record when the CNAME appears after it.
  • systems/systemd-resolved — stub resolver that does handle the reordered response, because it parses records into an ordered set (DnsAnswerItemDnsAnswer.items) and searches the whole set for CNAME matches rather than walking linearly.
  • Cisco Catalyst — three switch models (exact SKUs in Cisco's service document) whose DNSC process crashes on the reordered response, causing spontaneous reboot loops when configured to use 1.1.1.1 as resolver.

Concepts named

  • concepts/cname-chain — DNS alias-traversal primitive: www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1; each hop is a CNAME with its own TTL, cached independently, partially expire independently.
  • concepts/dns-message-ordering — the specific ambiguity axis: RRset-internal order (RFC 1034 §3.6: insignificant) vs. inter-RRset order in a message section (RFC 1034: unspecified).
  • concepts/resource-record-set — collection of records sharing (name, type, class); the atomic unit over which RFC 1034 defines ordering.
  • concepts/stub-vs-recursive-resolver — the full-resolver-vs-stub dichotomy that makes RFC 1034 §5.2.2's "restart on CNAME" language inapplicable to the clients that actually run on end-user devices.
  • concepts/rfc-normative-language — RFC 2119 MUST/SHOULD/MAY (1997); why pre-1997 RFCs can't be read with post-1997 normative intuitions.
  • concepts/latent-misconfiguration — the "shipped but undetected until downstream parser hits it" bug shape; sibling of the 2025-07-14 1.1.1.1 outage's latent-misconfig shape.
  • concepts/backward-compatibility — the long-tail-of-deployed- clients discipline that promotes "the spec allows it" into "we still can't ship it".

Patterns named

  • patterns/test-the-ambiguous-invariant — when a spec is ambiguous but deployed clients rely on one reading, the correct reading is the one you should test for; Cloudflare named the missing test as a remediation.
  • patterns/fast-rollback — 8 min from incident-declaration to revert-start; enabled by the single-commit single-path nature of the change.
  • patterns/staged-rollout — the change had been rolled through a testing environment (2025-12-10) and a global 90 %- fleet phased rollout (2026-01-07 23:48 → 2026-01-08 17:40 UTC), but the defect was invisible at every pre-90 % checkpoint because the broken clients are a minority of traffic; the sibling lesson is that staged rollouts catch crashes that scale with traffic volume but not the subset of client-implementation-specific crashes where the broken population is small and uncorrelated with POP selection.

Operational numbers and timeline (UTC)

Time (UTC) Event
2025-12-02 Record-reordering change introduced to 1.1.1.1 codebase
2025-12-10 Change released to testing environment
2026-01-07 23:48 Global release starts
2026-01-08 17:40 Release reaches 90 % of servers (impact onset)
2026-01-08 18:19 Incident declared (+39 min from impact onset)
2026-01-08 18:27 Release reverted (+8 min from declaration)
2026-01-08 19:55 Revert completed — full impact ends
  • Total impact window: ~135 minutes (17:40 → 19:55 UTC); severe impact ~47 min (17:40 → 18:27 UTC before revert started).
  • Rollout window: 38 days (code commit) → 29 days (test env) → 8 hours (start → 90 % fleet) — typical multi-stage Cloudflare cadence.
  • Affected client classes: glibc getaddrinfo stubs (Linux userspace, broad surface), Cisco Catalyst DNSC (three SKUs, configured to use 1.1.1.1 as resolver — reboot-looped).
  • Unaffected client classes: systemd-resolved, systems/cloudflare-1-1-1-1-resolver|1.1.1.1 itself as upstream for other recursives, BIND, Unbound, modern browsers' builtin DoH clients.

Caveats

  • The raw post does not enumerate how many end users / sessions / queries were affected — Cloudflare only describes the client-implementation classes that are broken and the rollout timeline. Magnitude is inferable (glibc getaddrinfo is the default Linux userspace resolver path → any Linux app not using its own resolver library was exposed) but the raw doesn't give a percentage.
  • The Internet-Draft has not yet gone through IETF DNSOP consensus as of publication; the post is advocating, not reporting on, a new RFC.
  • The raw gives the getanswer_r code pattern but not the corresponding systemd-resolved source path in full; the systemd-resolved code shown is DnsAnswerItem / DnsAnswer type definitions, with the actual CNAME-chain search logic described prose-wise as "search the entire answer set" without the function name.
  • The Cisco service document is linked from the raw but is behind cisco.com; the three specific Catalyst SKUs are not named in the raw body (only in the linked advisory).
  • Temporal framing: this is the second named 1.1.1.1 outage on the wiki, after 2025-07-14. Both are internal-config / internal-code causes rather than attacks or BGP hijacks, and both detonate through a latent-defect-on-a-non-test-covered- surface mechanism. The pairing makes 1.1.1.1 the canonical wiki instance of anycast-scale services failing from within through latent defects that pre-deployment gates don't catch.

Source

Last updated · 200 distilled / 1,178 read