PATTERN Cited by 1 source
Serve-stale over SERVFAIL¶
Intent¶
When a DNS recursive resolver cannot fetch a fresh authoritative answer — because the upstream nameserver is down, returning broken DNSSEC signatures, or otherwise failing — serve the last-known-good cached record past its TTL rather than returning SERVFAIL. Codified for DNS in RFC 8767 (Serving Stale Data to Improve DNS Resiliency, 2020).
The pattern is the DNS-resolver realisation of fail-stale — prefer bounded-staleness-on-good-data to an error that makes the name unreachable.
Problem¶
A DNS recursive resolver has exactly two states per cached record under plain TTL semantics:
- Fresh: TTL not yet elapsed, serve from cache, no upstream call.
- Expired: TTL elapsed, re-fetch from upstream.
When upstream is broken, every TTL-expiry moment converts into
a SERVFAIL. The SERVFAIL rate climbs as more records age out —
this is exactly what the 2026-05-05 .de DNSSEC break produced:
"After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing."
— sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage
Without serve-stale, a 3-hour cache-age distribution converts into 3 hours of steadily-climbing user impact.
Solution¶
RFC 8767 introduces a third state between fresh and expired:
- Stale but servable: TTL elapsed, but upstream attempts have failed. Serve the old record to the client; retry upstream in the background.
The validating resolver:
- Attempts the upstream fetch on TTL expiry.
- On upstream failure (SERVFAIL, timeout, broken DNSSEC), keeps the expired record in cache.
- Serves the expired record to the client with a bounded staleness window (RFC 8767 suggests up to 1–3 days).
- Continues to retry upstream in the background until either upstream recovers or the staleness bound is exceeded.
The effect on the 2026-05-05 .de incident:
"What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's 'serve stale' at work."
"This significantly cushions the impact of an upstream outage, buying time for operators to respond."
Structure¶
Client query for record R
│
▼
┌─────────────────────┐
│ Resolver cache has R│
│ (fresh or stale) │
└─────────────────────┘
│
├─── Fresh → serve, no upstream call
│
├─── Expired, upstream fetch succeeds → refresh cache, serve fresh
│
└─── Expired, upstream fetch fails → serve stale, retry in background
When it fits¶
- Recursive DNS resolver (public like 1.1.1.1, or internal origin-resolution for a CDN).
- The authenticity of cached records is reliably higher than the freshness of any possible replacement during an upstream outage (true for DNSSEC-validated records: the signature is in cache too).
- The records don't change very often. A-record + NS-record data typically has hours-to-days stability even when the upstream experiences minutes of turbulence; stale answers are usually still correct.
- Complementary to NTA: serve-stale absorbs the first hours; NTA ends impact once scope is confirmed and community coordination has aligned.
When it doesn't fit¶
- Records that flip often. Low-TTL A records that point at short-lived cloud instances are a poor fit — the stale answer is likely wrong.
- Strict-freshness workloads. Service-discovery-style hostnames where staleness might route traffic to a decommissioned endpoint.
- Non-DNS caches with different consistency requirements. HTTP caches have their own stale-while-revalidate semantics (RFC 5861) that predate RFC 8767 and target different failure modes.
Substrate requirements¶
- Cache retention past TTL. The resolver must keep expired records in cache rather than evicting them at TTL-expiry time.
- Background-fetch capability. On a serve-stale response, the resolver should attempt an async upstream refresh so the cache isn't only serving stale.
- Bounded staleness. Per RFC 8767, typically 1–3 days; the ceiling ensures stale doesn't become permanent.
- Correct DNSSEC behaviour — cached records still carry their signatures; a validating resolver serving stale is serving previously-validated data, not bypassing DNSSEC. (This is the property that makes serve-stale compatible with strict validation.)
Failure modes¶
- Stale record is actually wrong. If the operator rotated records during the outage, the stale answer sends clients to dead endpoints. Accepted cost; usually preferable to SERVFAIL.
- Unbounded staleness. A resolver that never stops serving stale will keep returning very old records long after upstream recovery if background fetches also continue to fail. Bounded staleness + eventual eviction resolves this.
- Upstream recovers but client sees stale for a long time. Resolvers vary on whether to prefer a fresh-refetch over a still-valid stale entry once fresh is possible; the right-setting is aggressive-refetch-when-healthy.
Canonical instance¶
2026-05-05 DENIC .de DNSSEC break absorbed by 1.1.1.1. Per
the 2026-05-06 Cloudflare post, serve-stale (RFC 8767)
implemented in Big Pineapple kept the
NOERROR rate "relatively stable throughout the incident" for
~3 hours, during which time Cloudflare investigated + deployed
the NTA-equivalent
override at 22:17 UTC. (Source:
sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage.)
Seen in¶
- sources/2026-05-06-cloudflare-when-dnssec-goes-wrong-de-tld-outage — canonical wiki instance of serve-stale as the DNS-resolver realisation of fail-stale at recursive-resolver altitude. RFC 8767 quoted directly. The stable NOERROR-rate graph + Cloudflare's "serving stale" framing makes this the cleanest production illustration of the pattern on the wiki.
Related¶
- concepts/fail-stale — the general principle this pattern specialises to DNS-resolver altitude.
- concepts/dns-resolver-caching — the cache substrate serve-stale extends.
- concepts/stale-while-revalidate-cache — the HTTP-cache sibling (RFC 5861); same failure-mode posture, different protocol altitude.
- patterns/negative-trust-anchor-for-tld-outage — the complementary mitigation; serve-stale absorbs hours, NTA ends impact.
- concepts/dns-servfail-response — the error that serve-stale replaces with a successful-but-stale answer.