PATTERN Cited by 1 source
Event-boundary-driven recovery point selection¶
Pattern¶
When recovering from a confirmed cyber event, don't pick the most recent backup — pick the most recent backup before the event boundary that also passes the validation pipeline. Walk reverse- chronologically from the most recent pre-boundary candidate, validate each, and step back if validation fails.
The four-step procedure:
- Build an investigation timeline from CloudTrail, VPC Flow Logs, GuardDuty, Security Hub, and workload logs to identify the earliest plausible indicator → event boundary.
- Evaluate candidates in reverse chronological order, starting from the most recent backup that predates the boundary.
- Run the validation pipeline against each candidate. If validation fails, step back to the next candidate.
- Approve the chosen recovery point with documentation of approver and rationale (typically via MPA).
Verbatim from the canonicalising source:
"For most operational recoveries, the most recent backup is the right one. For cyber events and for data corruption more generally, the most recent working copy is often a better target. If an adversary was present in the environment before detection, backups taken during that window might carry the same issues." (Source: sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events)
Why this differs from generic DR's selection¶
| Disaster type | Selection heuristic | Reason |
|---|---|---|
| Region failure / AZ failure / fault | Most recent backup | The backup is trusted; only the primary failed |
| Cyber event / data corruption | Most recent working backup before event boundary | The most recent backup may carry the same issues as production |
Generic DR picks the latest backup because the backup is trusted by assumption. Cyber-resilience selection treats backups as untrusted until validated, and walks backwards because validation may fail.
The event boundary as a lower bound on adversary presence¶
The boundary is "earliest plausible indicator", not "confirmed attack timestamp". The conservative posture is "assume adversary presence earlier than current evidence shows" — which is why:
- The investigation timeline draws from multiple sources to push the boundary as far back as evidence supports.
- The validation pipeline is the second guard — catching modifications the timeline didn't surface.
Both are needed.
Two-guards composition: timeline + validation¶
| Guard | What it catches | What it misses |
|---|---|---|
| Timeline | Logged activity before the boundary | Modifications that didn't generate audit events |
| Validation pipeline | Modifications visible in restored state (malware, integrity violations, config drift) | Modifications that appear normal to validation checks |
Together they catch:
- Logged + visible modifications (both guards catch).
- Logged + invisible modifications (timeline catches → boundary pushes back → validation isn't even run on those candidates).
- Unlogged + visible modifications (validation catches; timeline may not show this period as suspect, but validation rejects the candidate).
The residual class is unlogged + invisible modifications — which are very rare on AWS because most actions leave CloudTrail trails, but possible in poorly-instrumented environments.
The retention-window implication¶
The pattern requires retention extending beyond detection latency. Verbatim:
"Backup retention should include recovery points that predate realistic detection windows in your organization. Detection timing varies widely by organization and by threat type, so this is a number to set based on your own investigation capabilities and to revisit as those mature."
Practical sizing:
- Detection latency — how long does it typically take to detect a breach? (Industry baselines vary widely; can be days to months.)
- Plus margin — for the "unknown unknown" case where adversary presence predates detected indicators.
- Plus operational reserve — for cases where the most recent working RP fails validation and you need to step further back.
Mature designs commonly set retention significantly longer than the routine RPO target.
Documentation as a structural step¶
Step 4's documentation requirement is part of the algorithm:
- Approver identity — who authorised this restore.
- Rationale — why this candidate, not an earlier or later one.
- Recorded automatically in CloudTrail when MPA is configured.
- Document the rationale for recovery point selection: investigation findings, validation results, and the basis for the decision, in your incident management process.
The documentation isn't bureaucracy — it's evidence for post-incident review and (if required) regulatory reporting.
When to use this pattern¶
Use this pattern when:
- A cyber event is confirmed.
- Data corruption affecting both production and recent backups is a realistic possibility.
- Validation pipeline exists to test candidates independently.
Weaker fit when:
- The disaster is a clear non-adversary fault (region failure, AZ outage) — generic DR's most-recent-backup heuristic applies.
- Detection latency is so short that the most recent backup is before the event boundary anyway.
- No validation pipeline exists, so step 3 can't be executed.
Composition with other patterns¶
- patterns/parallel-investigation-validation-rebuild — Stage 1 (timeline) + Stage 2 (validation) jointly execute this selection algorithm.
- patterns/mpa-gated-restore-authorization — Stage 3 gates on the documented approval that step 4 produces.
- concepts/multi-layer-restore-validation-pipeline — the per-candidate filter step 3 invokes.
- concepts/compromise-boundary-recovery-point-selection — the concept this pattern canonicalises.
Failure modes¶
- Boundary set too late (too recent). Selected candidate is actually post-compromise; restore reintroduces the adversary. Mitigation: investigation timeline draws from multiple sources; conservative interpretation of indicators.
- Insufficient retention. All retained candidates are post- boundary; recovery is impossible. Mitigation: size retention to exceed plausible detection latency.
- Validation pipeline fails for many candidates. Recovery target keeps stepping back; data loss exceeds business tolerance. Mitigation: communicate longer-than-expected RPO; ensure validation pipeline is well-tuned (not overly noisy).
- Timeline construction takes too long. Stage 1 dominates Stages 2+4 in parallel runtime; MTTR is gated on timeline. Mitigation: pre-built timeline-construction queries; SIR partnership for coordinated triage.
- Approval rationale is insufficiently documented. Post-incident review can't reconstruct the decision basis. Mitigation: structured decision template; CloudTrail captures approval as management event but not full rationale (rationale needs to live in incident-management system).
Generalisation beyond AWS¶
The pattern applies wherever:
- Multiple time-ordered backups exist.
- An investigation timeline can establish a lower-bound event boundary.
- Each candidate can be validated independently.
GCP / Azure / on-prem cyber-recovery workflows apply the same algorithm with their respective audit-log + backup primitives.
Seen in¶
- sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events — canonical wiki reference; explicit four-step algorithm; most- recent-working-copy framing; multi-source investigation timeline; reverse-chronological candidate evaluation; validation-pipeline filter; documented approval requirement.
Related¶
- concepts/cyber-resilience — the parent posture.
- concepts/compromise-boundary-recovery-point-selection — the concept canonicalisation.
- concepts/multi-layer-restore-validation-pipeline — the per- candidate filter.
- patterns/parallel-investigation-validation-rebuild — the workflow this pattern lives inside.
- patterns/mpa-gated-restore-authorization — the approval gate.
- systems/aws-cloudtrail, systems/amazon-vpc-flow-logs, systems/aws-security-hub — investigation-timeline substrates.