ZALANDO 2024-01-22

Zalando — Tale of 'metadpata': the revenge of the supertools¶

Summary¶

Zalando's 2024-01-22 post (author Adrian Chifor, Principal Engineer on the cloud infrastructure team) is a DNS-outage postmortem from November 2022, landing in the thick of Cyber Week preparation. A pull request that edited YAML configuration for one test account contained a stray p character turning the field name metadata into metadpata. The config was consumed by an account-lifecycle background job — what Zalando's postmortem coins a supertool: an application with fleet-wide destructive authority. Because the job used the metadata object to compute the set of accounts to operate on, the typo collapsed that set to "no accounts specified", which the job's logic interpreted as "all accounts are decommissioned". The decommission path began deleting Route 53 hosted zones across AWS accounts, causing the Zalando shop to go offline for customers and locking almost everyone except the cloud infrastructure team out of internal tools (DNS-entry dependent). Only a deletion error partway through limited the blast radius. Recovery was a manual DNS-entry restoration from cached copies, ordered essential tooling → core infrastructure → on-site services.

The post is explicitly not a toolchain rant; it's an infrastructure-change-safety retrospective. Five remediations are catalogued, grouped under "account lifecycle management", "change validation", "change previews", and "phased rollout":

Scream test via Network ACLs + DNS delegation removal, left for one week before real decommissioning — a reversible pre-delete state that surfaces surviving dependencies.
Cost-benefit-weighted deletion deferral — resources with low cost-savings potential relative to impact of wrong deletion are now deleted manually, with a 7-day cost-threshold check gating whether manual deletion happens at all.
Mandatory-key presence checks + cfn-lint preview of every stack template before deploy.
jsonschema validation of all YAML configuration files, run both locally via pre-commit hooks and in CI/CD pipelines; IDE autocompletion and schema validation via the # yaml-language-server: $schema=… comment directive (Red Hat's YAML Language Server idiom).
Pull-request change previews via CloudFormation ChangeSets — read the ChangeSet from each account in the AWS Organization, merge into a human-readable PR comment, then drop the ChangeSet. The author can execute or reject from the PR.
Phased rollout across release channels — the phased- rollout idea that already existed for Zalando's Kubernetes cluster upgrades is extended to AWS infrastructure changes. Release channels map to AWS account categories (playground, test, infra, …, production); every change must graduate through all channels before hitting production. The trade-off acknowledged: rollouts take longer.

The resolver summary: "Supertools never sleep (unless you program them otherwise!). They're powerful yet often misjudged in review processes as they're expected to only trigger action in the scope of expected changes. As our story shows, this is highly dependent on the implementation and it's highly important to implement additional safety nets in the processes and tooling."

Key takeaways¶

A supertool is an application with fleet-wide destructive authority that runs on a normal-looking config change. The class name Zalando coins is load-bearing: review cultures systematically under-weight supertool PRs because the PR looks like a small config edit — the delta in impact is not on the diff surface. "They're expected to only trigger action in the scope of expected changes." (Source: sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools)
The collapse-to-all trap is a concrete supertool failure mode: a set-computation that converts "no targets specified" into "every target" via a logic shortcut. Zalando's account-lifecycle tool used the metadata object to compute the accounts-in-scope set; the metadpata typo turned that set empty; the decommission path read empty as "decommission everything." The bug is the semantic collapse, not the typo — a correct YAML key with an empty value would have had the same effect. The pacing of destructive operations had also been sped up as a cost-saving measure, amplifying how much damage the collapse could do before any human noticed.
Cost optimization on destructive automation is a reliability risk. Zalando explicitly names the accelerated deletion pacing ("As part of cost-saving measures, the pacing of executing deletion operations was sped up") as a contributor: faster deletions mean a larger blast radius before any error surfaces. The remediation inverts the trade-off — low-savings resources are deferred to manual deletion with a 7-day cost-threshold gate; the potential cost increase is accepted as cheaper than the expected cost of a wrong-delete incident.
DNS entries are a load-bearing infrastructure control plane. Deleted hosted zones didn't just take the customer shop offline — they locked most of the organisation out of internal tools and AWS account consoles, because the tools' own access paths resolve through the same DNS. "Apart from changing configuration for a test account … all of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries, rendering the incident response difficult." See concepts/dns-outage-recovery and patterns/recovery-priority-essential-tooling-first.
Scream test via Network ACLs + delegation removal is a reversible pre-delete state. "We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning." The trick: the account is still restorable (ACLs can be removed, DNS delegation can be re-added), but from the outside the account looks dead. A surviving dependency screams within the 7-day window. See patterns/scream-test-before-destructive-delete and concepts/scream-test-for-deletion.
Change previews via CloudFormation ChangeSets belong in the PR. Zalando's change-preview workflow: for every pull request that edits a CloudFormation stack, call the CreateChangeSet API in each account in the AWS Organization, merge the resulting JSON previews into a human-readable summary, post it as a PR comment, and drop the ChangeSet. Approvers see the per-account actual delta, not just the template diff. (Source: sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools)
jsonschema + yaml-language-server gives a single validation definition across commit-time, CI, and IDE. Zalando's config-validation strategy uses one JSON Schema as the source of truth, run (a) in pre-commit hooks locally, (b) in the CI/CD pipeline server-side, (c) live in developer IDEs via the # yaml-language-server: $schema=schema/config_schema.json comment the Red Hat YAML Language Server reads. The failure the original typo slipped through was exactly what this catches: a field that doesn't match any declared property. See patterns/jsonschema-validated-config-both-local-and-ci.
Phased rollout across release channels extends the Kubernetes-cluster-rollout idea to AWS accounts. The existing Kubernetes cluster-rollout shape — gradual advancement through groups of clusters, every change hitting every group in order — is re-used at the AWS infrastructure layer by treating AWS account categories as the channels: playground → test → infra → production. Trade-off: every rollout now takes longer, accepted in exchange for limited-blast-radius early detection. (Source: sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools)
Incident Commander specialization reduces context-switching cost in long incidents. "Throughout the incident response we switched the Incident Commanders according to their areas of expertise which kept the incident response focused and efficient." Rotating ICs by expertise area (not just by time) matches the phase of the response to the skillset required.
Google Docs concurrent-editor limit is a postmortem failure mode. A real operational lesson buried in the text: high-attention postmortems attract more concurrent editors than Google Docs supports, blocking editing. Zalando's workaround — "we've changed all links to Post Mortem documents shared with big audiences use the /preview URL by default" — is a tiny process change with a real productivity impact.

Systems and concepts extracted¶

Systems: systems/cloudformation (ChangeSet API is load-bearing for the PR-preview pattern), systems/cfn-lint, systems/yaml-language-server, systems/pre-commit, systems/aws-route-53 (hosted zones deleted during the incident), systems/kubernetes (the prior-art phased-rollout substrate), systems/github-actions (CI/CD pipeline host).
Concepts: supertool (new), scream test for deletion (new), destructive- automation blast radius (new), cost-weighted deletion deferral (new), CloudFormation ChangeSet (new), jsonschema config validation (new), release-channel rollout (new), DNS outage recovery (new), blast radius (extended).
Patterns: PR preview of CloudFormation ChangeSet (new), phased rollout across release channels (new), jsonschema-validated config at commit and CI (new), scream test before destructive delete (new).

Operational numbers¶

Incident scope: 60+ colleagues on the response call when the author joined.
Trigger: single p character in one YAML field name (metadata → metadpata) in a PR that also touched test-account config.
Scream-test window: 1 week — account left in ACL-isolated, delegation-removed state before real decommissioning.
Cost-threshold gate: previous 7 days of account cost observed before deciding whether to delete a resource or leave it.
Release channels named: playground, test, infra, production (as example categories).
Recovery time: "we were back online in a few hours" (no precise MTTR given).

Caveats / gaps¶

No MTTR, no revenue impact, no customer-facing downtime figure disclosed. "The shop to effectively go offline for our customers" is the entire impact disclosure.
Scope of DNS-zone deletion not quantified. How many hosted zones, how many accounts, how many CNAMEs — not published.
Supertool inventory not revealed. The post says "a review of all supertools in place" happened but doesn't publish the count or the taxonomy.
The account-lifecycle tool is not named. Almost certainly an in-house Zalando tool (cloud-team-operated AWS Organization manager), but the post keeps it unnamed.
Change-preview implementation depth not disclosed — does the bot invoke CreateChangeSet serially or in parallel across accounts; what's the permission shape; are Service Control Policies updated similarly; what's the latency from PR open to preview? All silent.
Release-channel graduation criteria not disclosed — does a change advance automatically on a timer, only after manual approval, or only after health signals clear? The trade-off "takes a longer time" implies some timer, but the post doesn't specify.
The 2023-10 follow-up referenced indirectly — the post is a 2024-01 writeup of a November 2022 incident. Other Zalando posts published in the interim don't reference this incident by name (confirmed on the wiki), suggesting the organisation treats it as an isolated lesson rather than a named axis.