Skip to content

PATTERN Cited by 1 source

Scream test before destructive delete

Problem

A cleanup pipeline or supertool is about to permanently delete a resource (AWS account, DNS zone, S3 bucket, database, VPC, …). The operation is irreversible — once deleted, the resource and its data are gone. But the team proposing the deletion may not know every consumer of the resource:

  • A scheduled job that runs monthly may still depend on it.
  • A third team may CNAME into its DNS zone without telling the owning team.
  • A legacy internal tool may resolve through its hostname on a rare code path.

Naive default: delete on a timer once no one objects. Consumers who didn't see the announcement (vacation, newly-joined, different team) only discover the dependency when the deletion fires — and by then, nothing can be done.

Solution

Insert a reversible, simulated-deletion phase before the irreversible delete. The resource is made to behave as deleted from the outside (so consumers scream) while remaining trivially restorable (so a single scream reverses the simulation). Only after a defined window without complaints does the real deletion run.

Zalando's canonical instance

Zalando's 2024-01 metadpata postmortem names the pattern as a direct response to destructive-automation incidents:

"We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning."sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools

Two simulation mechanisms compose:

  1. Network ACL blocks traffic to and from the account's network resources. Consumers hit connection refused / timeouts — same observable effect as hard delete.
  2. DNS delegation removal makes the account's hostnames stop resolving. CNAMEs targeting the account fail.

Window: 1 week. Both mechanisms are reversible: remove the ACL, re-add the delegation.

Mechanism template

for each resource R scheduled for deletion:
    1. Apply simulation (ACL isolate, revoke DNS delegation,
       disable credentials, revoke access keys, etc.)
    2. Record timestamp and owner contact.
    3. Wait SCREAM_WINDOW (e.g. 7 days).
    4. If any scream arrived during the window:
         revert simulation; escalate to human; skip delete.
    5. Otherwise:
         proceed with irreversible delete (possibly via
         manual path per cost-weighted-deletion-deferral).

Simulation mechanism by resource class

Resource Reversible simulation
AWS account Network ACL + DNS delegation removal (Zalando)
DNS hosted zone Remove delegation in parent zone
Database Revoke all grants; keep data present
S3 bucket Apply "deny all" bucket policy; keep objects
IAM role Detach all policies; keep role definition
VPC Add block-all security group; revoke peering
Load balancer Remove DNS target; keep listeners

The common shape: make the resource unreachable via its normal access path without deleting any state.

Trade-offs and prerequisites

  • The scream window costs money. Idle resources still accrue cost during simulation. Pair with concepts/cost-weighted-deletion-deferral to cap the ongoing cost exposure.
  • Window length is a guess. One week catches weekly-and-more-frequent consumers; monthly or quarterly batch jobs still won't scream. High-risk resources may need 30-day windows.
  • Observability for the scream has to already exist. If a downstream job catches the DNS failure silently, no one knows the resource was load-bearing.
  • The reversal path has to be rehearsed. A scream arriving at 3am has to be reversible by whoever's on call, not only the team that wrote the scream test.
  • Simulation has to be indistinguishable from deletion for consumers. A "weak" simulation (slow response instead of no response) may fool callers into thinking there's a latency issue rather than a deletion.

When not to use it

  • The resource state itself is load-bearing and can't be left idle. A database under live replication cannot be "simulated as deleted" without actually breaking replication.
  • Regulatory deletion deadlines. GDPR / CCPA right-to-be- forgotten deletions cannot wait 7 days.
  • Cost of the idle resource exceeds the cost of a wrong-delete. If the scream window costs more than rebuilding a mistakenly-deleted version, skip the window.

Composes with

Seen in

Last updated · 501 distilled / 1,218 read