CONCEPT Cited by 1 source
DNS outage recovery¶
Definition¶
DNS outage recovery is the operational category of restoring service after DNS entries themselves have been lost, corrupted, or made unresolvable — distinct from ordinary service outages where DNS continues to work. The defining property: the tooling you normally use to diagnose and fix production (consoles, CI/CD, observability dashboards, cloud APIs) also depends on DNS and is simultaneously unavailable.
Why DNS outage is a distinct class¶
Three properties make DNS outage recovery different from regular incident response:
- The control plane for recovery is affected too. Internal tools reach the cloud provider through hostnames; those hostnames resolve through the same DNS whose entries are missing. The cloud infrastructure team is usually the only group still able to reach the provider — typically via IAM roles that work against raw endpoints without DNS.
- Recovery starts from cache. Running processes that resolved a hostname minutes before the failure still have the answer in their local DNS cache. Until that cache expires (TTL), they continue to work. A race against the TTL clock: whoever thinks to copy cached DNS entries into a restoration file before the TTL expires has a recovery artifact; afterwards they don't.
- Recovery is ordered by dependency. Essential tooling (authentication, consoles, deploy pipelines) restored first; then core infrastructure (shared databases, queues, ingress routers); then customer-facing services. Out-of- order recovery has a circular-dependency failure: you need X to restore Y, but X depends on Y.
Zalando's playbook from metadpata¶
From the 2024-01 postmortem:
"Some helpful and lucky souls hastily started to copy their cached DNS entries before they expired. It was an all hands on deck situation with everyone focused on the single goal of restoring service for our customers ASAP. What followed in the incident call was a controlled disaster recovery with colleagues manually restoring DNS entries starting with essential tooling, followed by core infrastructure, and the services powering our on-site experiences to restore service for our customers." — sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools
Three named steps:
- Cache harvesting under TTL pressure. Individuals with local caches of the deleted zones manually export cached resolution before the TTL expires.
- Tiered restoration. Essential tooling → core infrastructure → on-site services. Explicitly ordered, driven by the Incident Commander on the call.
- Rotating Incident Commanders by expertise area. "Throughout the incident response we switched the Incident Commanders according to their areas of expertise which kept the incident response focused and efficient." DNS recovery's phases need different skillsets (DNS, IAM, load balancers, service health); rotating the IC matches the phase to the skillset.
Recovery time disclosed: "we were back online in a few hours." No precise MTTR published.
Who stays online during DNS collapse¶
The asymmetry Zalando names directly: "all of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries." The cloud infrastructure team retains access because:
- They often have IP-based or direct-endpoint access paths.
- Their tooling is designed to survive DNS loss (they are the team that built the DNS in the first place).
- They typically hold IAM roles that work against region-endpoint URLs without zone-specific CNAMEs.
The implication: only the cloud infrastructure team can recover. Every other team is blocked, which is why "60+ others are in the call" trying to help from the outside rather than recover themselves.
Preventative architectures (downstream of this concept)¶
The metadpata incident's full remediation set (catalogued
on concepts/supertool and
concepts/destructive-automation-blast-radius) all aim
at keeping DNS outage from occurring:
- patterns/scream-test-before-destructive-delete — 1 week reversible-isolation period.
- concepts/cost-weighted-deletion-deferral — DNS zones move out of automated-cleanup scope entirely.
- patterns/pr-preview-of-cloudformation-changeset — the PR would have shown "deleting N hosted zones" and been rejected.
- patterns/phased-rollout-across-release-channels — would have hit playground only, not production.
- patterns/jsonschema-validated-config-both-local-and-ci
— the
metadpatatypo itself flagged at commit time.
Caveats¶
- TTL races are non-deterministic. Whose cache has which entries at the moment of failure depends on access patterns in the preceding TTL window. Recovery from cache is a best-effort operation.
- Cached entries may be stale. If DNS records were rotated in the hours before the outage, the cache has the old values.
- No backup of DNS records is treated as authoritative here. Zalando's post is silent on whether there's a standard DNS-records backup; the recovery story is cache-based.
- This concept page is specific to recovery. Prevention and DNS-as-SPOF framings are different axes; see concepts/blast-radius and the Stripe DNS rate-limit source for related DNS infrastructure failure modes.
Seen in¶
- sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools
— canonical instance. Supertool
metadpatatypo caused fleet-wide Route 53 hosted-zone deletion; recovery ran on cached-entries-before-TTL-expiry + tiered restoration ordered essential-tooling → core-infra → on-site.