PATTERN Cited by 1 source
DNS health check cutover¶
Pattern¶
Cut over production traffic from the original (compromised / failed) environment to the rebuilt/recovered environment using DNS records with health checks so traffic shifts only when the new environment passes a readiness signal. The health check acts as a readiness gate between "infrastructure is built" and "traffic is flowing".
Verbatim from the canonicalising source:
"Use DNS records with health checks so traffic only shifts when the new environment is ready to serve it." (Source: sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events)
Why DNS health checks for cutover¶
DNS-based traffic shifting has properties that make it well-suited to the cyber-event cutover (Stage 5 of the parallel recovery workflow):
- Health check decouples readiness from cutover trigger. The recovery team initiates the cutover, but traffic only shifts when the new environment affirms it's ready via the health check response. Avoids the "we cut over and then realised it wasn't ready" failure mode.
- Granularity. DNS records can be split per service / per endpoint, so cutover can be incremental rather than all-or-nothing.
- Reversibility. If the rebuilt environment has a problem, DNS TTL'd traffic can be shifted back to the original (still-isolated) environment without code changes.
- No application changes. The pattern works with any application that resolves endpoints by DNS — which is essentially every web service, every database client, every API caller.
The readiness signal¶
The health check is the readiness signal — what does "ready to serve" mean for this workload?
Common readiness checks:
- HTTP endpoint returns 200 with expected response body.
- Database returns successful query for a known sentinel record.
- Synthetic transaction completes end-to-end (e.g. login + simple read).
- Multi-region health propagation — health check from multiple geos all pass.
The readiness check should be comprehensive enough to catch "infrastructure is up but data isn't ready" (a common cyber-event failure mode where IaC rebuild succeeded but data restore is still in progress).
Composition with cross-account-reference updates¶
Stage 5 of the cyber-recovery workflow has a critical pre-cutover step: update cross-account references. Verbatim:
"Before cutover, identify and update cross-account references that point to the original Production Account, IAM role trust policies, resource-based policies, AWS KMS key grants, and service integrations. IAM Access Analyzer and AWS Config can help identify these dependencies."
The DNS cutover is the last step of Stage 5 — only after cross- account references are updated. Otherwise the rebuilt environment fails to reach dependent services in other accounts and the health check fails.
Post-cutover monitoring¶
The canonicalising source's cutover guidance:
"Monitor the transition and keep the affected Production Account isolated until the investigation is complete."
Two monitoring requirements:
- Transition monitoring — verify traffic actually shifts; watch for clients pinned to the old DNS record (long TTLs); watch for error-rate increases as load shifts.
- Old account isolation — keep the original Production Account isolated for investigation continuity. Don't decommission the old account during cutover — it has forensic value, and a quick rollback may still be needed.
When to use this pattern¶
Use this pattern when:
- Cutting over to a rebuilt cyber-recovery environment.
- The workload uses DNS-resolvable endpoints (most do).
- A meaningful readiness check can be defined (most workloads can).
Weaker fit when:
- The workload uses hard-coded IP addresses or non-DNS endpoint resolution.
- DNS TTLs are very long (cutover takes hours to propagate); consider Global Accelerator or service-mesh routing as alternatives.
- The workload's clients aggressively cache DNS resolutions (mobile apps with stale resolver caches, JVM applications with long DNS-cache TTLs).
Composition with other patterns¶
- patterns/parallel-investigation-validation-rebuild — Stage 5 of this workflow uses DNS-health-check cutover.
- concepts/parallel-recovery-stages — the parent workflow concept.
- concepts/cyber-resilience — the recovery posture.
Failure modes¶
- Long client-side DNS caching. Even after DNS update, clients keep resolving the old endpoint. Mitigation: short TTLs prior to cutover; communicate cutover to clients with stale-cache risk.
- Health check passes but workload isn't actually ready. Health check is too narrow (e.g. just HTTP 200 on /health) and misses data-not-restored or downstream-dependency-failed scenarios. Mitigation: comprehensive synthetic transaction as health check.
- Health check uses production-account dependency. Health check itself requires a service in the (still-isolated) original Production Account; check fails for the wrong reason. Mitigation: health check infrastructure must live in the rebuilt environment.
- Cross-account references not updated before cutover. Rebuilt environment can't reach dependencies; health check fails. Mitigation: update references first; verify dependencies; then cut over.
- No rollback plan. If rebuilt environment has a latent issue that surfaces post-cutover, no procedure to shift traffic back. Mitigation: keep the original Production Account isolated but not decommissioned; document rollback DNS update.
Generalisation beyond AWS¶
The pattern applies wherever:
- DNS-based endpoint resolution is in use.
- Health checks can be configured as a readiness gate.
- Multiple endpoints can be defined for primary + secondary.
GCP / Azure / on-prem equivalents:
- GCP Cloud DNS with healthchecks-based routing policies.
- Azure Traffic Manager with health-probe-based endpoint selection.
- On-prem — BIND / dnsmasq / corporate DNS with monitoring- driven record updates; alternatives like HAProxy with health- check-driven backend selection.
The structural property is traffic flow gated on health-check affirmation, decoupled from the cutover trigger.
Seen in¶
- sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events — canonical wiki reference; "Use DNS records with health checks so traffic only shifts when the new environment is ready to serve it"; cross-account-reference update before cutover; old account isolation during transition.
Related¶
- concepts/cyber-resilience — the parent posture.
- concepts/parallel-recovery-stages — the workflow this pattern participates in.
- patterns/parallel-investigation-validation-rebuild — Stage 5 uses this pattern.
- systems/amazon-route53 — the DNS health-check substrate.