Skip to content

CONCEPT Cited by 1 source

Cost-weighted deletion deferral

Definition

Cost-weighted deletion deferral is the policy that resources with low cost-savings potential relative to the impact of a wrong deletion are excluded from automated cleanup — deletion is deferred to a manual (or operator-initiated) path, accepting a small ongoing cost increase as cheaper than the expected cost of a wrong-delete incident.

It is the explicit inversion of the default trade-off that cleanup automation optimises cost at the expense of safety.

Zalando's instantiation

From the 2024-01 metadpata postmortem:

"Having assessed the trade-offs and risks for deletion of resources, we have additionally decided to be more careful with deletion of resources that have low cost savings potential compared to the impact a wrong deletion could have. These changes are now done manually and take a longer time to complete, an acceptable trade-off we're willing to take to reduce the risk. To mitigate the potential cost increase, we are monitoring the account costs for the previous 7 days. In case it is over a certain threshold, we look at deleting the resources manually."sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools

Three mechanisms compose into the policy:

  1. Trade-off assessment per resource class. Cost-savings potential is evaluated against blast radius of a wrong delete. High-savings / low-risk → automated delete fine. Low-savings / high-risk → manual.
  2. Manual deletion path. The automation does not delete the flagged classes; a human invokes the delete, with whatever extra checks the operator applies.
  3. 7-day cost-threshold gate. To avoid paying indefinitely for zero-value resources, the previous 7 days of account cost is monitored. Above a threshold, manual deletion is scheduled. Below, the resource stays.

Why this is a reliability move

The incident's direct cause was a typo; the amplifier was "As part of cost-saving measures, the pacing of executing deletion operations was sped up." Accelerated automated cleanup increases destructive- automation blast radius per unit time. Cost-weighted deletion deferral removes entire classes of resources from the automation's reach, capping how much damage the automation can do in the worst case.

The cost increase is treated as insurance premium — the ongoing idle-resource cost is a small, predictable expense; a wrong mass-deletion is a low-probability, catastrophic expense (data loss, customer outage, hours of manual recovery). Accepting premium over expected catastrophic loss is a familiar risk-management trade.

Prerequisites

  • Classification rule for which resources are low-savings / high-risk. Hosted zones, persistent storage, production databases, IAM roles, DNS records typically sit here. Ephemeral compute, test-account resources typically sit in the high-savings bucket where automation stays.
  • Cost observability per account or per resource class with at least a rolling 7-day window.
  • An operator rota for the manual deletion path. If no one owns the manual deletes, the deferral becomes indefinite accumulation and the cost gate does nothing.
  • A threshold that reflects the organisation's acceptable ongoing idle cost. Zalando does not publish theirs.

Distinguishing from adjacent concepts

  • Deletion protection (DeletionPolicy: Retain, MFA delete, etc.) blocks deletion structurally. Cost-weighted deletion deferral is a policy decision, not a resource-level technical block.
  • Pause the daemon. The temporary deactivation Zalando did immediately post-incident is a binary stop; cost-weighted deletion deferral is the permanent, class-scoped version of the same idea.
  • Scream test for deletion happens after the decision to delete; cost-weighted deletion deferral happens before. The two compose: classes eligible for automation go through scream-test; classes ineligible skip the automation entirely.

Caveats

  • Manual deletion has its own blast radius. A human running a bulk delete can also typo the target. The policy only narrows what's in scope of automation — it doesn't eliminate destructive risk.
  • The 7-day window and threshold are sensitive. Too short, and the cost gate fires on normal weekly fluctuations; too long, and high-cost leaks sit for weeks. Zalando doesn't publish their values.
  • Low-value flag drift. What counts as "low cost savings potential" changes as pricing, usage patterns, and workload shapes evolve; the classification needs periodic revisit.

Seen in

Last updated · 501 distilled / 1,218 read