Skip to content

CONCEPT Cited by 1 source

Scream test for deletion

Definition

A scream test is a reversible pre-delete state that makes a resource behave (from the outside) as if it is already deleted, left in place long enough for any surviving dependency to scream — throw errors, page an owner, break a customer workflow. Only after the scream-test window elapses without complaints does the actual irreversible deletion proceed.

The critical property: the state is reversible. Whatever makes the resource "look deleted" (network isolation, name resolution removal, permissions revocation) can be undone trivially if a scream arrives. Only the next step — the actual delete — is irreversible.

Zalando's instantiation

Zalando's 2024-01 postmortem of the metadpata incident introduces this as the first infrastructure-change remediation:

"We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning."sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools

The two simulation mechanisms are complementary:

  1. Network ACL isolation. Inbound/outbound traffic to the account's resources is blocked at the network layer. Any upstream caller gets connection refused / timeouts — the behaviour of a hard-deleted resource, but reversible by removing the ACL.
  2. DNS delegation removal. The account's DNS zone delegation is removed from the parent; CNAMEs and name lookups targeting the account start failing to resolve. Reversible by re-adding the delegation.

Scream-test window: 1 week.

Why this matters for destructive supertools

The whole motivation is supertool-blast-radius containment. A supertool that deletes is maximally dangerous because deletion is the one operation you cannot apologize for. A scream test converts the supertool's operation into a two-phase process:

  • Phase 1 (reversible): simulate deletion. This is what the supertool automates.
  • Phase 2 (irreversible): real deletion. This runs only after the scream-test window and can be gated by a human review, a cost check, or manual invocation.

Even if Phase 1 is wrongly triggered (typo, collapsed set, etc.), Phase 2 is held off long enough for a human to notice and revoke.

Distinguishing from adjacent safety mechanisms

  • Soft delete (tombstone row, deleted_at timestamp) is a database-row-level scream test. This concept generalises to any deletable resource.
  • Deletion protection (AWS DeletionPolicy: Retain, S3 MFA-Delete, etc.) prevents delete at all. A scream test is an orchestrated path to deletion, not a block.
  • Dry-run shows what would happen without executing. A scream test actually executes the simulation; real callers feel the effect.
  • Blue-green swaps traffic between versions, also using a reversible intermediate state. Scream test is the deletion-specific equivalent.

Prerequisites

  • The simulation mechanism has to be cheap to revert. If undoing the Network ACL or DNS delegation change requires a complex migration, the scream test itself becomes an incident.
  • A scream-test window long enough to surface dependent traffic. Zalando chose a week, covering at least one full business cycle. Batch jobs that run monthly still won't scream in one week.
  • Observability for complaints — if a scream test happens but no one is on-call for the dependent system, the scream gets missed.

Caveats

  • The scream window costs money. AWS account resources keep accruing cost during the week (data storage, reserved capacity, idle compute). Zalando pairs this with a [[concepts/cost-weighted-deletion-deferral|7-day cost threshold]] check to bound the tolerance.
  • Low-traffic dependencies may not scream in time. A dependency that runs quarterly won't raise an alarm in a one-week window. High-blast-radius deletions may need longer windows or targeted reachout to owners.
  • The scream is only as loud as the dependent system's observability. If a downstream job catches the DNS failure and retries silently, no one knows the resource was load-bearing until the real delete.

Seen in

Last updated · 501 distilled / 1,218 read