Skip to content

AWS 2026-05-20 Tier 1

Read original ↗

AWS Architecture Blog — Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

Summary

A 2026-05-20 AWS Architecture Blog post that lays out a complete cyber-resilience reference architecture for recovering AWS workloads after ransomware, data extortion, or other destructive events. The piece is structured around a recovery posture in which the production environment, the backups, and the recovery path itself may all be untrustworthy after a confirmed cyber event — and walks through five architectural primitives that the recovery design has to supply: (1) a three-account isolation topology (Production / Recovery / Isolated Recovery Environment) inside one AWS Organization, where the Recovery Account owns deletion-protected backups and the IRE owns the rebuild surface; (2) the AWS Backup logically air-gapped vault in Compliance mode as the service-enforced deletion-protection primitive (no principal — including the account root user or a compromised administrator — can shorten retention or delete recovery points within their retention period); (3) a multi-layer validation pipeline combining AWS Backup Restore Testing, Amazon GuardDuty Malware Protection, AWS Marketplace partner content scanners, workload- specific consistency / invariant / configuration-diff checks, and log and audit review across the backup window — all running inside the IRE so that a tainted restore stays contained; (4) a five-stage recovery workflow in which Stages 1–2 (investigation timeline + candidate validation) and Stage 4 (infrastructure rebuild from IaC in a separate version-controlled repository) run in parallel, gated by a Stage 3 Multi-party approval (MPA) authorization step before validated data is restored, with Stage 5 doing the cross-account-reference update + DNS health-check cutover; and (5) the Rebuild-Restore-Rotate framework as the decision rubric for what comes from where — "Infrastructure is code. Data is backup. Credentials are new" — with the explicit caveat that the framework assumes the IaC source itself wasn't the attack target. The post also frames recovery- point selection as a reverse-chronological scan from the most recent candidate before the event boundary through the validation pipeline, with documented approver rationale on accept.

The post is single-authored AWS architecture content with no partner co-marketing layer — every architectural disclosure is a first-party AWS reference design. Combined with the wiki's existing [[sources/2026-03-31-aws-streamlining-access-to-dr-capabilities|2026-03-31 streamlining-DR post]] (which canonicalised AWS Backup, AWS DRS, and the cross-Region/cross-account axes) and the 2026-01-30 sovereign-failover post (which canonicalised the DR ladder across partition boundaries), this post completes the wiki's coverage of AWS-native DR by adding the cyber-event-specific layer on top of the general DR primitives — the part where backups, credentials, and the source environment can no longer be trusted.

Key takeaways

  1. Cyber resilience is the recovery leg of a three-leg posture. The post opens with the explicit decomposition: "Cyber resilience is the ability to recover workloads to a known-good state after an adversary has affected the environment. Prevention works to keep threat actors out and detection works to find them quickly. Cyber resilience focuses on recovery: restoring a trustworthy environment when backups, credentials, or parts of the infrastructure can no longer be assumed to be safe." This gives the wiki a named posture distinct from generic DR (concepts/disaster-recovery-tiers) and distinct from prevention/detection — the whole architecture exists because the source environment is no longer trusted. Canonical wiki entry for concepts/cyber-resilience.

  2. Three-account isolation topology inside one AWS Organization. Recovery design uses separate AWS accounts as the trust boundary — "the recovery environment, including its identities, keys, and network paths, shouldn't share a trust boundary with the environment being recovered. If production identity is compromised, recovery must be able to proceed without depending on it." Three roles:

  3. Production Accounts — workloads run; isolated for investigation when a cyber event is confirmed ("Recovery work doesn't happen in production, because in some scenarios remediation in place may not fully restore trust.")

  4. Recovery Account — owns the AWS Backup logically air-gapped vault and configures who can share it, who initiates a restore, and who approves a restore via Multi-party approval; SCPs restrict the account to backup operations so a compromised production identity can't modify these controls.
  5. Isolated Recovery Environment (IRE) — where backups are restored, validated, and the new production environment is rebuilt before cutover. "Has no trust relationship to the Production Account, no VPC peering to it, and no internet-facing resources, so a tainted restore discovered during validation stays contained inside the IRE instead of reaching back into production or out to the internet." Uses VPC endpoints (AWS PrivateLink) for AWS service APIs without internet or VPC peering.

Canonical wiki entries for concepts/isolated-recovery-environment and patterns/three-account-cyber-recovery-topology.

  1. AWS Backup logically air-gapped vault as service-enforced deletion protection. "A logically air-gapped vault is always locked in Compliance mode. The service itself enforces retention, so recovery points can't be deleted by any principal, including the account root user or a compromised administrator, within the retention period." Key architectural properties:

  2. Recovery points live in AWS service-owned accounts, not customer accounts. The vault object in the customer's Recovery Account is the governance and access boundary — sharing, restore authorization, MPA — but the actual data lives behind a service boundary the customer can't reach to delete. "This separation is what makes the air-gap logical rather than network- based." (Source: this post.)

  3. Encryption choice: service-owned key or customer-managed KMS key.
  4. Sharing via AWS RAM — restores can be initiated from owning or shared account.
  5. MPA via IAM Identity Center — predefined approvers required before restore; "particularly valuable when the source account might no longer be trusted."
  6. Direct backup target for fully managed resources — S3, DynamoDB, EFS — no staging in standard vault first. Non-fully-managed (EBS, Aurora, FSx) use "intelligent orchestration" with temporary snapshots.
  7. Out-of-vault fallback: "For S3 data outside the vault's supported resource set, Amazon S3 Object Lock in Compliance mode paired with S3 Versioning provides equivalent deletion protection at the S3 layer."

Canonical wiki entries for systems/aws-backup-logically-air-gapped-vault, systems/aws-multi-party-approval, systems/amazon-s3-object-lock, and concepts/service-enforced-deletion-protection.

  1. Validation = recoverable + safe. "A successful restore confirms that the backup was readable. Validation confirms that it's safe to use. No single check catches everything, which is why validation combines several layers." The five layers:
Layer Capability What it provides
AWS native AWS Backup Restore Testing Automated verification that backups are recoverable, with custom hooks via the PutRestoreValidationResult API
AWS native Amazon GuardDuty Malware Protection Malware scanning on restored volumes
AWS Partner AWS Marketplace partner solutions Content-level ransomware scanning inside backup contents, without requiring a full restore first
Workload-specific Integrity / consistency checks Database consistency, application invariants, configuration diffs vs known-good baseline
Cross-cutting Log / audit review Identify unexpected identity / config changes across the backup window via CloudTrail + workload logs

"Both AWS-native validation and workload-specific validation should pass before a recovery point is approved. Validation happens in the IRE so that if any check detects a problem, the affected restore is contained inside the IRE and doesn't reach production." The post is also explicit about recovery-point time skew as a real concern: "AWS backup mechanisms operate independently per service, so recovery points for different services might not be precisely time-synchronized. Aligning backup schedules as tightly as possible and including cross-service consistency checks in the validation pipeline reduces this gap." Canonical wiki entry for concepts/multi-layer-restore-validation-pipeline.

  1. Recovery-point selection: reverse-chronological scan from before the event boundary. "For most operational recoveries, the most recent backup is the right one. For cyber events and for data corruption more generally, the most recent working copy is often a better target. If an adversary was present in the environment before detection, backups taken during that window might carry the same issues." The four-step selection algorithm:

  2. Build an investigation timeline from CloudTrail + VPC Flow Logs + GuardDuty + Security Hub + workload logs to identify the earliest plausible indicator of the event. That timestamp = the event boundary.

  3. Evaluate recovery-point candidates in reverse chronological order, starting from the most recent backup that predates the event window.
  4. Run the validation pipeline against each candidate. If validation fails, step back to the next candidate.
  5. Approve the chosen recovery point with documentation of the approver and rationale.

The post is also explicit on retention: "Backup retention should include recovery points that predate realistic detection windows in your organization. Detection timing varies widely by organization and by threat type, so this is a number to set based on your own investigation capabilities and to revisit as those mature." Canonical wiki entries for concepts/compromise-boundary-recovery-point-selection and patterns/event-boundary-driven-recovery-point-selection.

  1. Five-stage recovery workflow with three stages running in parallel. "Recovery has five stages. Three of them run at the same time because the slowest path through recovery is what determines how long the business is down. Investigation and validation run in parallel with infrastructure rebuild so the new environment is being built while the recovery point is being chosen. We wait to restore data because restoring untrusted data into a new environment defeats the purpose of the validation." The stages:

  2. Stage 1 — Establish the timeline. Query CloudTrail / VPC Flow Logs / GuardDuty / Security Hub / workload logs for the earliest indicator. "AWS Security Incident Response (SIR) can provide coordinated triage and response support for this stage."

  3. Stage 2 — Validate candidates in reverse chronological order against the validation pipeline; runs in parallel with Stage 1 because investigation and validation checks don't depend on each other.
  4. Stage 3 — Approval (the gate). MPA approvers authorize; "the approval action is automatically recorded as an AWS CloudTrail management event." Document rationale: investigation findings + validation results + decision basis. "If validation fails on the chosen candidate, return to Stage 2 with an earlier one."
  5. Stage 4 — Rebuild and restore. Rebuild infrastructure in the IRE from IaC templates stored in a separate, version- controlled repository; runs in parallel with Stages 1 and 2. After Stage 3 approval, restore validated data from the vault into rebuilt infrastructure. Apply credential rotation during this stage following Rebuild-Restore-Rotate.
  6. Stage 5 — Cutover. Move production traffic via DNS records with health checks so traffic shifts only when the new environment is ready. "Before cutover, identify and update cross-account references that point to the original Production Account, IAM role trust policies, resource-based policies, AWS KMS key grants, and service integrations." IAM Access Analyzer and AWS Config help identify these. "Monitor the transition and keep the affected Production Account isolated until the investigation is complete."

Canonical wiki entries for concepts/parallel-recovery-stages and patterns/parallel-investigation-validation-rebuild and patterns/mpa-gated-restore-authorization.

  1. The Rebuild-Restore-Rotate framework. "Cyber recovery requires sorting what gets rebuilt from code, restored from backup, and generated fresh: Infrastructure is code. Data is backup. Credentials are new."
Category Examples Why
Rebuild from code IAM policies + roles, Security Groups, EC2, VPC, Lambda, CI/CD pipeline definitions Configurations come from reviewed, version-controlled templates rather than from a backup that may have been affected
Restore from backup RDS, Aurora, EFS, EBS, FSx Business data cannot be recreated from code and must come from validated, immutable backups
Rotate or re-issue IAM access keys, database passwords, API keys, certificates, OAuth tokens, SSH keys Any secret that may have been exposed during the event window is replaced, not carried forward from backup

Sub-cases the post calls out explicitly:

  • Two-category services: "Some services sit across two categories. For example, Amazon S3 buckets and Amazon DynamoDB tables have both configuration (rebuilt from code) and data inside them (restored from backup), so recovery treats the two layers separately."
  • AWS-issued credentials: "Some credentials are re-issued by AWS rather than rotated by you. For example, consider service- linked roles and STS session tokens. The framework still applies, it's just AWS that issues them fresh."
  • Derived data stores skipped from backup: "Other data stores aren't backed up at all because they are derived from sources that are backed up. Search indexes, analytics tables, caches, and materialized views are common examples. These regenerate from restored data, so they are a recovery dependency rather than a separate recovery category but they must be included in the recovery runbook and sequenced after the data they depend on has been restored."
  • Upstream-source compromise caveat: "The framework assumes that your source of configuration, including IaC templates, pipelines, and source repositories, wasn't itself the target of the attack. If it was, recovery starts further upstream with a trusted copy of source before rebuild can begin." Highest- leverage caveat in the framework — the "trusted source of truth for code" is itself a recovery dependency.

Operational prerequisite for the rotate leg: "a rotation process that already exists and is exercised. AWS Secrets Manager rotation, IAM Identity Center session revocation, AWS Certificate Manager renewal, and workload-specific rotation hooks are components most customers already have in some form. The cyber recovery capability is the ability to invoke that rotation comprehensively and verify that nothing was missed."

Canonical wiki entries for concepts/rebuild-restore-rotate-framework and patterns/iac-rebuild-from-separate-version-control.

  1. IaC from separate version control. Implicit but load-bearing: the rebuild leg requires that IaC source live in a repository separate from the workload it deploys, with its own access controls, so a compromise of production credentials doesn't reach the source. The framework comment about "if [the IaC source] itself was the target of the attack, recovery starts further upstream with a trusted copy of source" is the explicit acknow- ledgment that the source repository is itself a recovery target in adversary modelling.

  2. Coverage gap fallback for unsupported services. "For services not currently supported by the logically air-gapped vault, Cross- Region Replication to a locked bucket or service-native point-in- time recovery can serve as interim options. These are recovery- oriented copies rather than tamper-proof storage and should be treated accordingly when designing around them." The honest acknowledgment that the vault doesn't cover every AWS service yet, and the framing that PITR / CRR-to-locked-bucket fallbacks provide weaker tamper-resistance — recovery-oriented but not service-enforced.

  3. Seven-step starting checklist for teams building cyber recovery capability. Quoted in full below as the post's explicit operational checklist:

    1. "Create a logically air-gapped vault in a dedicated Recovery Account, and configure Multi-party approval for restore operations."
    2. "Establish an Isolated Recovery Environment in advance, with no trust relationship to production and no network path into the production environment. Pre-configure the networking, monitoring, and access controls required for recovery operations. Use SCPs to enforce isolation."
    3. "Enable AWS Backup Restore Testing on a regular schedule, and enable Amazon GuardDuty Malware Protection for backup and volume scanning."
    4. "Define workload-specific integrity checks for business- critical data (database consistency, application invariants, configuration diffs)."
    5. "Confirm the credential rotation process works end-to-end and can be invoked as part of recovery, not only on a routine schedule. AWS Secrets Manager rotation provides the automation framework for database passwords and API keys."
    6. "Map cross-account dependencies (IAM role trust policies, resource-based policies, AWS KMS key grants, and service integrations) and maintain the inventory in your recovery runbook."
    7. "Exercise the full workflow, including investigation, validation, rebuild, restore, and cutover, on a regular schedule."

    Item 7 is the load-bearing one — the workflow is only useful to the extent it's been exercised against drills, because cyber events are rare and the muscle memory has to come from practice.

Architectural patterns / numbers

  • Three-account topology: Production Accounts (where workloads run, isolated on confirmed event); Recovery Account (vault owner + governance); IRE (rebuild + validation surface, no trust to production, no internet, VPC endpoints only).
  • Compliance-mode vault lock: service-enforced retention; no principal can shorten or delete within retention period.
  • MPA: predefined approvers via IAM Identity Center; CloudTrail management event records approval.
  • Five validation layers: AWS Backup Restore Testing + GuardDuty Malware Protection + AWS Marketplace partner content scanners + workload-specific checks + log/audit review.
  • Five recovery stages: Stages 1, 2, 4 run in parallel; Stage 3 approval is the gate; Stage 5 cutover via DNS health checks.
  • Rebuild-Restore-Rotate: code / data / credentials → three treatment categories.

Caveats and open questions

  • No quantified production numbers. The post is a reference architecture without disclosed customer-deployment metrics (no MTTR numbers, no validation-pipeline-pass rate, no cost-of-vault-storage numbers).
  • Vault service coverage gap is acknowledged but not enumerated. "For services not currently supported by the logically air-gapped vault" — the post does not list which services are unsupported as of 2026-05-20.
  • Trusted-IaC-source assumption. The framework explicitly assumes the IaC repository wasn't the attack target — but the post does not prescribe how to architect that source's protection, only acknow- ledges its centrality. "Knowing where your known-good source of configuration lives, and how it is protected, is worth thinking through in advance."
  • Recovery-point time skew across services. Acknowledged as a real issue ("recovery points for different services might not be precisely time-synchronized") with two mitigations (align schedules + cross-service consistency checks) but no quantified bound on the residual skew.
  • Detection-window-vs-retention-period sizing. The post explicitly defers retention sizing to the customer ("a number to set based on your own investigation capabilities and to revisit as those mature") — no industry-typical number disclosed.

Cross-source continuity

  • Direct sequel to 2026-03-31 streamlining-DR — the 2026-03-31 post canonicalised AWS Backup, AWS DRS, the cross-Region/cross-account axes, and the clean-room recovery account framing as general DR; this 2026- 05-20 post extends those primitives with the cyber-event-specific layer (logically air-gapped vault as the primary backup target, MPA-gated restore authorization, IRE as a third account, Rebuild- Restore-Rotate as the recovery framework, parallel five-stage workflow). Same architectural lineage; this post is the cyber- resilience-specific deepening.
  • Sibling at the multi-account-as-isolation-boundary axis to 2026-05-11 single-vs-multiple Organizations — both treat the AWS account boundary as a load-bearing isolation primitive, but at different altitudes (organizational structure vs cyber-recovery topology) and with different drivers (regulatory / billing vs ransomware / compromise isolation).
  • Sibling at the IaC-as-foundation-for-rebuild axis to 2026-02-25 6000 accounts — both rely on IaC-driven account provisioning + centralised governance via SCPs as the substrate for the architecture; the cyber-resilience post is the destructive-event use case of the same IaC-everywhere posture.
  • Sibling at the concepts/blast-radius axis to numerous posts in the corpus — but this is the wiki's first treatment of blast radius applied to the post-compromise-recovery direction: containment of the recovery itself so a tainted restore stays contained inside the IRE.
  • Sibling at the concepts/clean-room-recovery-account axis to the 2026-03-31 post that named the concept — this post adds the third account (IRE) as the rebuild/validate surface on top of the clean-room recovery account.

Source

Last updated · 542 distilled / 1,571 read