Skip to content

CONCEPT Cited by 1 source

Parallel recovery stages

Definition

Parallel recovery stages is the recovery-workflow design principle that independent stages run concurrently to compress the critical path of a cyber-event recovery. The principle's load-bearing observation:

"Three of them run at the same time because the slowest path through recovery is what determines how long the business is down. Investigation and validation run in parallel with infrastructure rebuild so the new environment is being built while the recovery point is being chosen. We wait to restore data because restoring untrusted data into a new environment defeats the purpose of the validation." (Source: sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events)

The five-stage workflow

The canonicalising source's reference workflow has five named stages:

Stage Name Parallel? What happens
1 Establish the timeline Yes (with 2, 4) Build investigation timeline from CloudTrail / VPC Flow Logs / GuardDuty / Security Hub / workload logs; identify earliest indicator → event boundary
2 Validate candidates Yes (with 1, 4) Run validation pipeline against candidates in reverse chronological order
3 Approval Sequential (gate) MPA approvers authorize chosen recovery point; rationale documented
4 Rebuild and restore Yes rebuild (with 1, 2); sequential restore (after Stage 3) Rebuild infrastructure in IRE from IaC; restore validated data after approval
5 Cutover Sequential (after 4) Move production traffic via DNS health checks; update cross-account references

Why three stages can run in parallel

The key observation is independence:

  • Stage 1 (timeline) depends on log data only — independent of rebuilding infrastructure.
  • Stage 2 (validation) depends on candidate restoration into the IRE — independent of timeline construction.
  • Stage 4 rebuild leg depends on IaC templates only — independent of which recovery point is chosen, and independent of the timeline.

These three streams have no data dependencies until Stage 3 (the approval gate) — so running them in parallel doesn't change the correctness of the workflow, only the wall-clock time.

The data-restore wait: why it can't be parallelised earlier

The canonicalising source is explicit about why data restore waits until after Stage 3:

"We wait to restore data because restoring untrusted data into a new environment defeats the purpose of the validation."

If data restore happened in parallel with rebuild, the rebuilt IRE would contain potentially-tainted data before validation has approved a clean recovery point. The approval gate is precisely the "this data is now trusted enough to put in the rebuilt environment" checkpoint.

This means rebuild can run in parallel with validation (because rebuild doesn't depend on which RP is chosen) but restore cannot (because restore depends on the chosen RP and the validation-pass result).

Stage 3 as the synchronisation point

Stage 3 (Approval) is the only sequential gate in the workflow. Everything before it runs concurrently; everything after it depends on its output (the approved recovery point + the rebuilt infrastructure).

This makes the approval the rate-limiting human-in-the-loop step — and is why MPA approvers are pre-defined, not selected at incident time. The workflow assumes approvers can be reached quickly; the MPA design via IAM Identity Center supports this.

Stage 5: DNS health checks as the traffic-cutover primitive

The cutover stage uses Route 53 health checks to shift traffic "only when the new environment is ready to serve it":

"Use DNS records with health checks so traffic only shifts when the new environment is ready to serve it."

Composes with patterns/dns-health-check-cutover — DNS-based traffic shifting with health checks as the readiness gate.

Stage 5: cross-account-reference update is part of cutover

A subtle architectural point in Stage 5 — "Before cutover, identify and update cross-account references that point to the original Production Account, IAM role trust policies, resource-based policies, AWS KMS key grants, and service integrations."

Cross-account references (IAM trust policies in other accounts that trust the old Production Account ID, resource policies on shared resources, KMS key grants) won't automatically point to the new Production Account after rebuild. This has to be enumerated and updated at cutover time.

The canonicalising source recommends IAM Access Analyzer + AWS Config for the inventory.

Critical-path arithmetic

Without parallel stages:

Total = T_timeline + T_validation + T_approval + T_rebuild + T_restore + T_cutover

With parallel stages 1+2+4:

Total = max(T_timeline, T_validation, T_rebuild) + T_approval + T_restore + T_cutover

If timeline / validation / rebuild are roughly equal (each measured in hours), the parallel design saves roughly 2x the per-stage time on the critical path — a multi-hour reduction in MTTR for a typical cyber-event recovery.

Generalisation: the recovery-workflow shape

This pattern generalises beyond AWS to any recovery workflow where:

  1. Some stages are independent (don't share data dependencies).
  2. A human approval gate exists in the middle.
  3. The post-approval stages depend on multiple parallel-stage outputs (rebuild + chosen RP + validation result).

GCP / Azure / on-prem cyber-recovery workflows apply the same parallel-stages-around-a-gate shape with their respective primitives.

Seen in

Last updated · 542 distilled / 1,571 read