CONCEPT Cited by 2 sources
DNS SERVFAIL response¶
SERVFAIL (RCODE 2)
is the generic "server failure" response code a DNS server
returns when some error occurred during resolution: upstream
timeout, policy rejection, internal failure, or an unreachable
authoritative nameserver. It tells the client "something went
wrong" with no further information — the error is opaque.
Investigation implication¶
Because SERVFAIL is a black-box errno, it cannot be the
bottom-of-stack signal for root-cause investigation. The
engineer must pivot from "we're seeing SERVFAIL" to other
signals:
- Resolver internal metrics. Unbound's request-list depth metric shows whether queries are queuing locally.
- Packet-rate observation. A packet-level rate metric (see patterns/iptables-packet-counter-for-rate-metric) can reveal whether the outbound side is hitting a cap.
- Manual reproduction via dig. Running the failing query by hand against each resolver in the chain localises which layer is slow or failing.
- Request-list dump.
unbound-control dump_requestlistshows pending queries and what they're waiting on.
Seen in¶
-
Cloudflare — When DNSSEC goes wrong: how we responded to the
.deTLD outage (2026-05-06). Canonical wiki instance of SERVFAIL as the spec-mandated response to a DNSSEC validation failure. When DENIC published non-validatable signatures during the 2026-05-05.deTLD rollover, every validating resolver on the Internet — systems/cloudflare-1-1-1-1-resolver|1.1.1.1 included — "was required by the DNSSEC specification to reject them and return SERVFAIL to clients." The SERVFAIL rate on 1.1.1.1 "climbed steadily over the following three hours as cached records slowly started expiring" — canonical illustration of SERVFAIL-rate being TTL-gated when upstream is broken. The post also disclosed a 1.1.1.1 bug in Extended DNS Error (EDE) propagation: DNSSEC-Bogus errors surfaced as EDE 22 ("No Reachable Authority") rather than EDE 6 ("DNSSEC Bogus"), compounding the opacity of the SERVFAIL itself. Mitigation was the complementary pair of serve-stale (RFC 8767, which kept NOERROR rates stable for hours) + Negative Trust Anchor equivalent (deployed at 22:17 UTC). (Source: sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage.) -
Stripe — The secret life of DNS packets (2024-12-12). Stripe's initial alert signal was a small-percentage
SERVFAILrate for internal requests during hourly spikes. The post explicitly calls out the opacity: "SERVFAILis a generic response that DNS servers return when an error occurs, but it doesn't tell us much about what caused the error." The investigation pivoted to Unbound's request-list depth metric to localise the problem.
Related¶
- concepts/dns-request-amplification-via-retries
- concepts/request-queue-depth-metric
- concepts/redundant-error-signalling — the design lesson inverse: if your failure signal is opaque, complement it with independent observability signals rather than rely on retry-and-escalate.
- concepts/extended-dns-errors — RFC 8914; the supplementary error channel that disambiguates SERVFAIL subtypes when implemented correctly.
- concepts/dnssec · concepts/negative-trust-anchor — the validator whose refusal produces SERVFAIL, and the surgical bypass mitigation.
- patterns/serve-stale-over-servfail — the primary failure-mode alternative; serve the cached answer instead of surfacing the SERVFAIL.