PATTERN Cited by 1 source
Scream test before destructive delete¶
Problem¶
A cleanup pipeline or supertool is about to permanently delete a resource (AWS account, DNS zone, S3 bucket, database, VPC, …). The operation is irreversible — once deleted, the resource and its data are gone. But the team proposing the deletion may not know every consumer of the resource:
- A scheduled job that runs monthly may still depend on it.
- A third team may CNAME into its DNS zone without telling the owning team.
- A legacy internal tool may resolve through its hostname on a rare code path.
Naive default: delete on a timer once no one objects. Consumers who didn't see the announcement (vacation, newly-joined, different team) only discover the dependency when the deletion fires — and by then, nothing can be done.
Solution¶
Insert a reversible, simulated-deletion phase before the irreversible delete. The resource is made to behave as deleted from the outside (so consumers scream) while remaining trivially restorable (so a single scream reverses the simulation). Only after a defined window without complaints does the real deletion run.
Zalando's canonical instance¶
Zalando's 2024-01 metadpata postmortem names the pattern as
a direct response to destructive-automation incidents:
"We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning." — sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools
Two simulation mechanisms compose:
- Network ACL blocks traffic to and from the account's network resources. Consumers hit connection refused / timeouts — same observable effect as hard delete.
- DNS delegation removal makes the account's hostnames stop resolving. CNAMEs targeting the account fail.
Window: 1 week. Both mechanisms are reversible: remove the ACL, re-add the delegation.
Mechanism template¶
for each resource R scheduled for deletion:
1. Apply simulation (ACL isolate, revoke DNS delegation,
disable credentials, revoke access keys, etc.)
2. Record timestamp and owner contact.
3. Wait SCREAM_WINDOW (e.g. 7 days).
4. If any scream arrived during the window:
revert simulation; escalate to human; skip delete.
5. Otherwise:
proceed with irreversible delete (possibly via
manual path per cost-weighted-deletion-deferral).
Simulation mechanism by resource class¶
| Resource | Reversible simulation |
|---|---|
| AWS account | Network ACL + DNS delegation removal (Zalando) |
| DNS hosted zone | Remove delegation in parent zone |
| Database | Revoke all grants; keep data present |
| S3 bucket | Apply "deny all" bucket policy; keep objects |
| IAM role | Detach all policies; keep role definition |
| VPC | Add block-all security group; revoke peering |
| Load balancer | Remove DNS target; keep listeners |
The common shape: make the resource unreachable via its normal access path without deleting any state.
Trade-offs and prerequisites¶
- The scream window costs money. Idle resources still accrue cost during simulation. Pair with concepts/cost-weighted-deletion-deferral to cap the ongoing cost exposure.
- Window length is a guess. One week catches weekly-and-more-frequent consumers; monthly or quarterly batch jobs still won't scream. High-risk resources may need 30-day windows.
- Observability for the scream has to already exist. If a downstream job catches the DNS failure silently, no one knows the resource was load-bearing.
- The reversal path has to be rehearsed. A scream arriving at 3am has to be reversible by whoever's on call, not only the team that wrote the scream test.
- Simulation has to be indistinguishable from deletion for consumers. A "weak" simulation (slow response instead of no response) may fool callers into thinking there's a latency issue rather than a deletion.
When not to use it¶
- The resource state itself is load-bearing and can't be left idle. A database under live replication cannot be "simulated as deleted" without actually breaking replication.
- Regulatory deletion deadlines. GDPR / CCPA right-to-be- forgotten deletions cannot wait 7 days.
- Cost of the idle resource exceeds the cost of a wrong-delete. If the scream window costs more than rebuilding a mistakenly-deleted version, skip the window.
Composes with¶
- concepts/cost-weighted-deletion-deferral — caps the cost exposure of the scream window; also chooses which resources go through the automated scream-test path at all.
- patterns/pr-preview-of-cloudformation-changeset — the PR that schedules a delete shows "will schedule scream test" rather than "will delete" in the ChangeSet preview.
- patterns/phased-rollout-across-release-channels — changes to the scream-test mechanism itself graduate through channels.
Seen in¶
- sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools
— canonical wiki instance. Post-
metadpataremediation introducing the 1-week Network ACL + DNS-delegation- removal scream-test step in the account decommissioning pipeline.