What came first: the CNAME or the A record?¶
Summary¶
Cloudflare post-mortem on the ~40-minute partial global outage of
systems/cloudflare-1-1-1-1-resolver|1.1.1.1 on 2026-01-08,
17:40–19:55 UTC. Root cause not an attack, not BGP — a
memory optimisation to the 1.1.1.1 cache implementation that
reordered resource records inside DNS responses, placing CNAME
records after the A/AAAA records they aliased instead of before.
Most DNS clients handle either order, but a subset of widely-deployed
stub resolvers — notably glibc's getaddrinfo() (via its
getanswer_r implementation) and Cisco's DNSC process in three
Catalyst switch models — parse the answer section sequentially
keeping track of an expected name: when a CNAME appears first, the
expected name is updated and the subsequent A record matches; when
the A record comes first, it is discarded as not-matching, and
resolution fails (or, on the Cisco switches, the DNSC process crashes
and the switch enters a reboot loop —
Cisco CSCvv99999 service advisory).
The structural cause is 40-year-old ambiguity in
RFC 1034 (1987):
the spec uses the word "preface" ("the recursive response to a
query will be one of the following: the answer to the query, possibly
preface by one or more CNAME RRs that specify aliases encountered on
the way to an answer") — not the
RFC 2119 normative MUST/SHOULD language introduced in 1997, ten
years later. RFC 1034 §3.6 explicitly says "the order of RRs in a
set is not significant" and §6.2.1 shows RR-order examples only for
two A records in the same RRset — it never clarifies relative
order between RRsets in a message section. RFC 4035 (DNSSEC) uses
explicit MUST language about which records to include but still does
not mandate a particular message-section order. Even with CNAMEs
first, sequential parsing breaks if the CNAME chain itself is out
of order, because each CNAME belongs to a different RRset (different
owner name) and the "RRset order is insignificant" statement doesn't
constrain inter-RRset ordering. Cloudflare reverted the change at
18:27 UTC and completed the revert by 19:55 UTC; remediation is
(1) keep CNAMEs first for compatibility with the long tail of
deployed stub resolvers, forever; (2) add tests asserting the order
invariant in the 1.1.1.1 cache code path (the team had originally
implemented CNAME-first but had no test, which is why the
optimisation silently broke it); (3) file an
IETF Internet-Draft
at DNSOP to formalise the behaviour as a new RFC. The 14-message
timeline in the raw shows the change entered the 1.1.1.1 codebase
2025-12-02, reached testing environment 2025-12-10, began global
release 2026-01-07 23:48 UTC, reached 90 % of servers 2026-01-08
17:40 UTC (impact onset), incident declared 18:19 UTC (39 min later
— slow by Cloudflare's usual standard, reflecting that most traffic
was unaffected), reverted by 18:27, full impact ended 19:55.
Key takeaways¶
-
The spec allows more behaviour than the install-base tolerates. RFC 1034's lack of normative ordering language is technically permissive, but a meaningful population of deployed stub resolvers (glibc
getaddrinfo, the Cisco DNSC process) requires CNAME-first. "What the spec permits" and "what you can safely ship" are different sets; the second is smaller and determined by the long tail of unreplaced clients, some of which will never be updated (backward compat). (Source: "we believe it's best to require CNAME records to appear in-order before any other records".) -
Age of spec ≠ normativity. RFC 1034 predates RFC 2119's MUST/SHOULD conventions by 10 years. Interpreting old RFCs with modern normative intuitions mis-reads them in both directions: statements that read as firm requirements may be advisory, and advisory-sounding statements may carry decades of implementer convention that is load-bearing in practice. Cloudflare's original implementation chose CNAME-first consistent with the "preface" hint but did not test the invariant (test the ambiguous invariant) because the spec wasn't unambiguous enough to mandate the test — exactly the gap the memory optimisation fell through.
-
RRsetvsRRin message sections is a separate axis from RRset-internal order. RFC 1034 §3.6 is clear that within an RRset (same name, type, class) order is insignificant and the A-records-for-same-name example in §6.2.1 backs this up — but the spec treats message sections as sequences of RRs and never says whether different RRsets in a section have a defined relative order. Modern usage (DNSSEC:RRSIGRRsets and their covered RRsets; CNAME chains:CNAMERRsets with different owner names) lives entirely on the axis the spec skips. The Cloudflare piece is the clearest public disentangling of these two ordering axes. -
CNAME-first alone doesn't save you — the chain itself must be in order too. Sequential parsers track a single expected name and update it each time they encounter a CNAME whose owner matches; if
cdn.example.com. CNAME server.cdn-provider.com.appears beforewww.example.com. CNAME cdn.example.com., the first CNAME is skipped because its owner doesn't match the initial QNAME, the second CNAME updates the expected name, and the first CNAME is never revisited. Implementations that parse into an ordered set first (e.g. systemd-resolved'sDnsAnswerItem) handle this correctly regardless of on-wire order. RFC 1034 does not require CNAME chains to be self-ordered either; the Internet-Draft proposes requiring both. -
Stub resolvers ≠ recursive resolvers. RFC 1034's "resolver simply restarts the query at the new name when it encounters a CNAME" guidance (§5.2.2) was written with full recursive resolvers in mind (1.1.1.1 itself, BIND, Unbound), not the stub resolvers (glibc
getaddrinfo, Cisco switch DNSC, embedded DNS clients) that most end-user applications actually go through. Stubs don't restart queries — they just parse the response from their configured recursive and extract addresses. The stub/recursive split is load-bearing for understanding which half of the DNS population implements which half of the RFC. -
A "zero-behaviour-change" memory optimisation wasn't. The patch replaced `Vec::with_capacity(…) + extend_from_slice + …
-
extend_from_slice
with a directentry.answer.extend(self. records)`, saving one allocation and one copy per partially- expired CNAME cache hit. Functionally equivalent under RFC 1034's "order is not significant" reading, observably different on the wire — the chain that used to be CNAMEs- then-A is now A-then-CNAMEs. This is a latent-defect-shaped bug in the same family as the 2025-07-14 1.1.1.1 outage: a seemingly-safe refactor on a surface whose invariant isn't captured by automated tests ships into production and detonates on the long tail of strict downstream consumers. -
Same-day-fix, single-cause revert. 39 min to declare, 8 min from declaration to revert (18:19 → 18:27 UTC), 88 min from revert to full impact-end (fleet-wide redeploy). The fast-rollback path worked cleanly because the change was a single commit on a single code path; compare to the 2025-07-14 outage (same surface, different defect class) where ~23 % of edge servers had already lost IP bindings by revert time and required a second slower pass through the change-management system.
-
RFC 4035 (DNSSEC) shows the alternative: MUST + higher priority. "When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included." This is the normative shape Cloudflare is proposing for CNAME ordering. The Internet-Draft (draft-jabley-dnsop-ordered-answer-section) at IETF DNSOP is the stated vehicle.
Systems named¶
- systems/cloudflare-1-1-1-1-resolver — the affected service.
- systems/glibc-getaddrinfo — affected stub resolver
implementation on Linux;
getanswer_r's sequential expected-name parse drops the A record when the CNAME appears after it. - systems/systemd-resolved — stub resolver that does
handle the reordered response, because it parses records into an
ordered set (
DnsAnswerItem→DnsAnswer.items) and searches the whole set for CNAME matches rather than walking linearly. - Cisco Catalyst — three switch models (exact SKUs in Cisco's service document) whose DNSC process crashes on the reordered response, causing spontaneous reboot loops when configured to use 1.1.1.1 as resolver.
Concepts named¶
- concepts/cname-chain — DNS alias-traversal primitive:
www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1; each hop is a CNAME with its own TTL, cached independently, partially expire independently. - concepts/dns-message-ordering — the specific ambiguity axis: RRset-internal order (RFC 1034 §3.6: insignificant) vs. inter-RRset order in a message section (RFC 1034: unspecified).
- concepts/resource-record-set — collection of records sharing (name, type, class); the atomic unit over which RFC 1034 defines ordering.
- concepts/stub-vs-recursive-resolver — the full-resolver-vs-stub dichotomy that makes RFC 1034 §5.2.2's "restart on CNAME" language inapplicable to the clients that actually run on end-user devices.
- concepts/rfc-normative-language — RFC 2119 MUST/SHOULD/MAY (1997); why pre-1997 RFCs can't be read with post-1997 normative intuitions.
- concepts/latent-misconfiguration — the "shipped but undetected until downstream parser hits it" bug shape; sibling of the 2025-07-14 1.1.1.1 outage's latent-misconfig shape.
- concepts/backward-compatibility — the long-tail-of-deployed- clients discipline that promotes "the spec allows it" into "we still can't ship it".
Patterns named¶
- patterns/test-the-ambiguous-invariant — when a spec is ambiguous but deployed clients rely on one reading, the correct reading is the one you should test for; Cloudflare named the missing test as a remediation.
- patterns/fast-rollback — 8 min from incident-declaration to revert-start; enabled by the single-commit single-path nature of the change.
- patterns/staged-rollout — the change had been rolled through a testing environment (2025-12-10) and a global 90 %- fleet phased rollout (2026-01-07 23:48 → 2026-01-08 17:40 UTC), but the defect was invisible at every pre-90 % checkpoint because the broken clients are a minority of traffic; the sibling lesson is that staged rollouts catch crashes that scale with traffic volume but not the subset of client-implementation-specific crashes where the broken population is small and uncorrelated with POP selection.
Operational numbers and timeline (UTC)¶
| Time (UTC) | Event |
|---|---|
| 2025-12-02 | Record-reordering change introduced to 1.1.1.1 codebase |
| 2025-12-10 | Change released to testing environment |
| 2026-01-07 23:48 | Global release starts |
| 2026-01-08 17:40 | Release reaches 90 % of servers (impact onset) |
| 2026-01-08 18:19 | Incident declared (+39 min from impact onset) |
| 2026-01-08 18:27 | Release reverted (+8 min from declaration) |
| 2026-01-08 19:55 | Revert completed — full impact ends |
- Total impact window: ~135 minutes (17:40 → 19:55 UTC); severe impact ~47 min (17:40 → 18:27 UTC before revert started).
- Rollout window: 38 days (code commit) → 29 days (test env) → 8 hours (start → 90 % fleet) — typical multi-stage Cloudflare cadence.
- Affected client classes: glibc
getaddrinfostubs (Linux userspace, broad surface), Cisco Catalyst DNSC (three SKUs, configured to use 1.1.1.1 as resolver — reboot-looped). - Unaffected client classes: systemd-resolved, systems/cloudflare-1-1-1-1-resolver|1.1.1.1 itself as upstream for other recursives, BIND, Unbound, modern browsers' builtin DoH clients.
Caveats¶
- The raw post does not enumerate how many end users /
sessions / queries were affected — Cloudflare only describes the
client-implementation classes that are broken and the rollout
timeline. Magnitude is inferable (glibc
getaddrinfois the default Linux userspace resolver path → any Linux app not using its own resolver library was exposed) but the raw doesn't give a percentage. - The Internet-Draft has not yet gone through IETF DNSOP consensus as of publication; the post is advocating, not reporting on, a new RFC.
- The raw gives the
getanswer_rcode pattern but not the corresponding systemd-resolved source path in full; the systemd-resolved code shown isDnsAnswerItem/DnsAnswertype definitions, with the actual CNAME-chain search logic described prose-wise as "search the entire answer set" without the function name. - The Cisco service document is linked from the raw but is behind cisco.com; the three specific Catalyst SKUs are not named in the raw body (only in the linked advisory).
- Temporal framing: this is the second named 1.1.1.1 outage on the wiki, after 2025-07-14. Both are internal-config / internal-code causes rather than attacks or BGP hijacks, and both detonate through a latent-defect-on-a-non-test-covered- surface mechanism. The pairing makes 1.1.1.1 the canonical wiki instance of anycast-scale services failing from within through latent defects that pre-deployment gates don't catch.
Source¶
- Original: https://blog.cloudflare.com/cname-a-record-order-dns-standards/
- Raw markdown:
raw/cloudflare/2026-01-19-what-came-first-the-cname-or-the-a-record-ef430b56.md
Related¶
- systems/cloudflare-1-1-1-1-resolver
- sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — the other 1.1.1.1 latent-defect outage; different defect class (config link vs. code refactor), same failure-shape family.
- companies/cloudflare
- concepts/cname-chain
- concepts/dns-message-ordering
- concepts/resource-record-set
- concepts/stub-vs-recursive-resolver
- concepts/rfc-normative-language
- concepts/backward-compatibility
- patterns/test-the-ambiguous-invariant
- patterns/fast-rollback
- patterns/staged-rollout