PATTERN Cited by 1 source
Parallel investigation / validation / rebuild¶
Pattern¶
Run three recovery activities in parallel — investigation timeline construction, recovery-point validation, and infrastructure rebuild — gated by a single MPA approval step before data restore proceeds. The pattern compresses the critical path of a cyber-event recovery by exploiting the independence of the three streams.
The full five-stage workflow:
Stage 1 (timeline) ─┐
│
Stage 2 (validate) ├──► Stage 3 (MPA approval) ──► Stage 4 restore ──► Stage 5 (cutover)
│ [SYNC GATE] data DNS health checks
Stage 4 rebuild from IaC ─┘ + cross-account refs
Verbatim from the canonicalising source:
"Recovery has five stages. Three of them run at the same time because the slowest path through recovery is what determines how long the business is down. Investigation and validation run in parallel with infrastructure rebuild so the new environment is being built while the recovery point is being chosen." (Source: sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events)
The five stages in detail¶
Stage 1 — Establish the timeline¶
Runs in parallel with Stages 2 and 4. Build the investigation timeline by querying:
- CloudTrail — control-plane API activity, identity changes, IAM modifications.
- VPC Flow Logs — data-plane network activity, lateral movement, unusual egress.
- GuardDuty findings — threat- detection signals.
- Security Hub — aggregated finding view.
- Workload logs — application-altitude indicators.
Outputs the event boundary — the timestamp of the earliest plausible indicator. "AWS Security Incident Response (SIR) can provide coordinated triage and response support for this stage."
Stage 2 — Validate candidates¶
Runs in parallel with Stages 1 and 4. Run the validation pipeline against recovery-point candidates in reverse chronological order starting from the most recent backup that predates the event window.
The pipeline runs inside the IRE so that "if any check detects a problem, the affected restore is contained inside the IRE and doesn't reach production."
If the most recent candidate fails validation, step back to the next ( reverse-chronological selection). Stage 2 doesn't depend on Stage 1 because validation pipeline checks (malware scan, integrity check, audit review) don't require the timeline to have been finalised.
Stage 3 — Approval (the SYNC GATE)¶
Sequential. The MPA approvers authorise a specific recovery point with documented rationale (investigation findings + validation results + decision basis). The approval is recorded as a CloudTrail management event.
This is the only synchronisation point in the workflow — Stages 1, 2, and 4 must all have produced enough output to make an approval decision possible.
Stage 4 — Rebuild and restore¶
Rebuild leg runs in parallel with Stages 1 and 2 (because IaC templates don't depend on which RP is chosen). Restore leg runs sequentially after Stage 3 approval.
Verbatim:
"Rebuild infrastructure in the IRE from infrastructure as code (IaC) templates stored in a separate, version-controlled repository. Rebuild runs in parallel with Stages 1 and 2. After Stage 3 approves a recovery point, restore the validated data from the logically air- gapped vault into the rebuilt infrastructure. Apply credential rotation during this stage following the Rebuild-Restore-Rotate framework."
The Rotate leg of Rebuild-Restore-Rotate also happens here — every secret in the rebuilt environment is rotated/re-issued.
Stage 5 — Cutover¶
Sequential after Stage 4. Use DNS records with health checks so traffic only shifts when the new environment is ready.
Critical: update cross-account references before cutover — "identify and update cross-account references that point to the original Production Account, IAM role trust policies, resource-based policies, AWS KMS key grants, and service integrations." IAM Access Analyzer + AWS Config help identify these.
After cutover, "Monitor the transition and keep the affected Production Account isolated until the investigation is complete."
The independence argument¶
The three parallel streams are structurally independent:
| Stream | Inputs | Outputs | Independent of |
|---|---|---|---|
| Stage 1 timeline | CloudTrail / VPC Flow Logs / GuardDuty / Security Hub / workload logs | Event boundary timestamp | Stages 2, 4 |
| Stage 2 validate | Recovery point candidates from vault | Validated candidate(s) | Stages 1, 4 (but uses event boundary from Stage 1 to filter candidates) |
| Stage 4 rebuild | IaC templates from separate repository | Rebuilt empty IRE infrastructure | Stages 1, 2 |
Stage 2 has a soft dependency on Stage 1's event boundary — the candidates Stage 2 validates are the ones predating the boundary — but this is loose: Stage 2 can validate the most recent candidates first while Stage 1 narrows the boundary, and discard already- validated candidates that turn out to post-date the eventually- finalised boundary.
Why restore can't run in parallel with rebuild¶
A subtle but load-bearing architectural point:
"We wait to restore data because restoring untrusted data into a new environment defeats the purpose of the validation."
If restore happened in parallel with rebuild, the rebuilt IRE would contain potentially-tainted data before validation has approved a clean recovery point. The MPA gate is precisely the "this data is now trusted enough to put in the rebuilt environment" checkpoint — restore comes after.
This means rebuild can be parallelised with validation (because rebuild doesn't depend on which RP is chosen) but restore cannot (because restore depends on the chosen RP and the validation-pass result).
Critical-path arithmetic¶
Without parallel stages:
With parallel stages 1+2+4 rebuild:
If timeline ≈ validation ≈ rebuild (typically hours each), the parallel design saves roughly 2× the per-stage time on the critical path — multi-hour MTTR reduction.
When to use this pattern¶
Use this pattern when:
- The workload's MTTR target is hours, not days.
- The team has the capacity to run three workstreams concurrently during incident time (different responders for timeline / validation / rebuild).
- The IaC source-of-truth is robust enough to support rebuild in parallel with investigation (no human review of IaC during recovery).
Weaker fit when:
- The team is too small to run streams in parallel (incident response is a single thread).
- The IaC source has its own integrity questions that need to be resolved before rebuild can proceed safely.
- Validation pipeline output requires manual interpretation that sequentialises with timeline construction.
Composition with other patterns¶
- patterns/three-account-cyber-recovery-topology — the workflow runs across the three accounts.
- patterns/mpa-gated-restore-authorization — Stage 3 is this pattern's synchronisation gate.
- patterns/event-boundary-driven-recovery-point-selection — the algorithm Stage 1+2 jointly execute.
- patterns/iac-rebuild-from-separate-version-control — Stage 4 rebuild leg.
- patterns/dns-health-check-cutover — Stage 5 cutover primitive.
Failure modes¶
- Parallel streams interfere. Stage 2's candidate-restore activity in IRE shares quotas with Stage 4's rebuild activity. Mitigation: pre-provisioned IRE capacity; per-stream quota reservations.
- MPA gate stalls. Approvers unreachable; parallel work piles up with no path forward. Mitigation: multi-geo approver pool; escalation procedures.
- Stage 2 keeps stepping back. Validation fails for many candidates; recovery target is far older than expected. Mitigation: communicate longer-than-expected RPO to business stakeholders; design retention generously.
- Stage 5 cutover discovers cross-account refs. Update list wasn't pre-inventoried; cutover stalls. Mitigation: maintain Access-Analyzer-driven inventory in runbook.
- Coordination overhead. Three parallel streams require coordination that the team can't sustain. Mitigation: clear role assignments; documented runbook; periodic drill exercises.
Generalisation beyond AWS¶
The pattern applies wherever:
- Multiple independent recovery activities can run concurrently.
- A human approval gate exists in the middle.
- Post-approval activities depend on multiple parallel-stage outputs.
GCP / Azure / on-prem cyber-recovery workflows apply this shape with their respective primitives.
Seen in¶
- sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events — canonical wiki reference; explicit five-stage workflow with Stages 1+2+4 rebuild in parallel; explicit reasoning for restore- after-approval; DNS-health-check cutover; cross-account-reference update at cutover.
Related¶
- concepts/cyber-resilience — the parent posture.
- concepts/parallel-recovery-stages — the concept this pattern canonicalises.
- concepts/multi-layer-restore-validation-pipeline — Stage 2 substance.
- concepts/compromise-boundary-recovery-point-selection — what Stage 1+2 jointly produce.
- concepts/rebuild-restore-rotate-framework — Stage 4 framework.
- patterns/three-account-cyber-recovery-topology — the topology this workflow runs across.
- patterns/mpa-gated-restore-authorization — the Stage 3 gate.
- patterns/dns-health-check-cutover — the Stage 5 primitive.
- systems/aws-cloudtrail, systems/amazon-vpc-flow-logs — Stage 1 substrates.