Skip to content

CONCEPT Cited by 1 source

Supertool

Definition

A supertool is Zalando's term for "applications or scripts that have the ability to execute large-scale changes across the infrastructure" — including destructive changes. A supertool is usually a daemon or background job that ingests a small-looking configuration input and emits fleet-wide operations: creation, lifecycle transitions, and cleanup / decommissioning of resources. The term was coined in Zalando's 2024-01 postmortem for a November 2022 DNS incident.

(Source: sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools)

Why the term is load-bearing

Review cultures systematically under-weight supertool PRs. The PR looks like a small config edit — a YAML field change, a JSON map entry, a line in a manifest — so reviewers apply the review rigor appropriate to a small config edit. But the blast radius is not on the diff surface; it's in the supertool's interpretation of the config. Zalando names this directly:

"Supertools never sleep (unless you program them otherwise!). They're powerful yet often misjudged in review processes as they're expected to only trigger action in the scope of expected changes. As our story shows, this is highly dependent on the implementation."sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools

The canonical failure mode — collapse-to-all

Zalando's metadpata incident is the textbook instance:

  1. Supertool: an account-lifecycle background job that sets up and decommissions AWS accounts.
  2. Config: YAML with a metadata object used to compute the set of accounts-in-scope for each operation.
  3. Change: a PR inserted a stray pmetadatametadpata.
  4. Interpretation: the supertool read metadata as missing; no accounts in scope.
  5. Semantic collapse: the decommission code path read "no accounts specified" as "all accounts are decommissioned."
  6. Result: Route 53 hosted zones started getting deleted fleet-wide until a deletion error halted the run.

The bug is the semantic collapse, not the typo — a correct YAML key with an empty value would have done the same.

Supertool-specific risk amplifiers

Beyond the interpretation flaw, Zalando names two reliability-regressing decisions that made the blast radius larger:

  • Cost-optimization on destructive pacing. "As part of cost-saving measures, the pacing of executing deletion operations was sped up." Faster deletion → larger blast radius before any human or error detects it.
  • Unbounded operation scope per run. The supertool operated on every account matching the computed set, with no per-run cap or staged rollout.

Remediations Zalando applied

After the incident, Zalando:

  1. Reviewed all supertools in place (inventory exercise).
  2. Temporarily deactivated the relevant deletion processes.
  3. Paused changes to the configuration until safety nets were in place.
  4. Shipped the safety-net stack (each catalogued on the wiki):
  5. patterns/scream-test-before-destructive-delete — Network ACL + DNS delegation removal for 1 week before real deletion.
  6. concepts/cost-weighted-deletion-deferral — resources with low cost-savings vs impact ratio are deleted manually with a 7-day cost-threshold gate.
  7. patterns/pr-preview-of-cloudformation-changeset — PR comment showing per-account CloudFormation ChangeSet previews.
  8. patterns/jsonschema-validated-config-both-local-and-ci — jsonschema validation at pre-commit + CI + IDE.
  9. patterns/phased-rollout-across-release-channels — changes must graduate playground → test → infra → production.

Seen in

Last updated · 501 distilled / 1,218 read