Skip to content

CONCEPT Cited by 2 sources

DNS SERVFAIL response

SERVFAIL (RCODE 2) is the generic "server failure" response code a DNS server returns when some error occurred during resolution: upstream timeout, policy rejection, internal failure, or an unreachable authoritative nameserver. It tells the client "something went wrong" with no further information — the error is opaque.

Investigation implication

Because SERVFAIL is a black-box errno, it cannot be the bottom-of-stack signal for root-cause investigation. The engineer must pivot from "we're seeing SERVFAIL" to other signals:

  • Resolver internal metrics. Unbound's request-list depth metric shows whether queries are queuing locally.
  • Packet-rate observation. A packet-level rate metric (see patterns/iptables-packet-counter-for-rate-metric) can reveal whether the outbound side is hitting a cap.
  • Manual reproduction via dig. Running the failing query by hand against each resolver in the chain localises which layer is slow or failing.
  • Request-list dump. unbound-control dump_requestlist shows pending queries and what they're waiting on.

Seen in

  • Cloudflare — When DNSSEC goes wrong: how we responded to the .de TLD outage (2026-05-06). Canonical wiki instance of SERVFAIL as the spec-mandated response to a DNSSEC validation failure. When DENIC published non-validatable signatures during the 2026-05-05 .de TLD rollover, every validating resolver on the Internet — systems/cloudflare-1-1-1-1-resolver|1.1.1.1 included — "was required by the DNSSEC specification to reject them and return SERVFAIL to clients." The SERVFAIL rate on 1.1.1.1 "climbed steadily over the following three hours as cached records slowly started expiring" — canonical illustration of SERVFAIL-rate being TTL-gated when upstream is broken. The post also disclosed a 1.1.1.1 bug in Extended DNS Error (EDE) propagation: DNSSEC-Bogus errors surfaced as EDE 22 ("No Reachable Authority") rather than EDE 6 ("DNSSEC Bogus"), compounding the opacity of the SERVFAIL itself. Mitigation was the complementary pair of serve-stale (RFC 8767, which kept NOERROR rates stable for hours) + Negative Trust Anchor equivalent (deployed at 22:17 UTC). (Source: sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage.)

  • Stripe — The secret life of DNS packets (2024-12-12). Stripe's initial alert signal was a small-percentage SERVFAIL rate for internal requests during hourly spikes. The post explicitly calls out the opacity: "SERVFAIL is a generic response that DNS servers return when an error occurs, but it doesn't tell us much about what caused the error." The investigation pivoted to Unbound's request-list depth metric to localise the problem.

Last updated · 542 distilled / 1,571 read