Skip to content

CONCEPT Cited by 1 source

Undocumented production change

An undocumented production change is any mutation of production state — code deploy, config push, schema migration, infra tweak, credential rotation, runtime toggle — that did not go through the organization's declared change process. The change exists; the process record of it does not.

These are the changes that traditional change-management architectures are structurally blind to. A CAB can only approve changes that are submitted to it; a change-request ticket only documents what someone typed into it. Anything that bypasses the gate is invisible to the gate — and that invisibility is the risk.

The Swedbank shape

From the 2023-08-16 post: Swedbank's April 2022 outage was caused by "an unapproved change to [their] IT systems" that temporarily corrupted ~1M customer balances. The regulator's finding: "none of the bank's control mechanisms were able to capture the deviation and ensure that the process was followed." In other words: the bank had a process, the process would (notionally) have caught the risk — but the change didn't go through the process, so the process didn't matter.

Why they happen

  • Emergency hotfixes under pressure, with the promise to "document it later."
  • Misaligned tooling boundaries: console changes, CLI changes, shell access, database direct-writes, feature-flag flips — each a separate surface, not all of which are wired into the change system.
  • Shadow IT / outsourcing: third-party vendors pushing changes on the bank's behalf outside the bank's own change system.
  • Well-intentioned operator shortcuts: a sysadmin who edits a config file on a single host to "test something" and forgets to revert.
  • Adversarial actions: a malicious insider deliberately bypassing the gate.

The stream-into-lake metaphor

The source post's analogy, quoted verbatim:

"We can think of software changes as streams, feeding into our environments which are lakes. Change management puts a gate in the stream to control what flows into the lake, but doesn't monitor the lake. If it is possible to make a change to production without detection, then change management only protects one source of risk. The only way to be sure you don't have undocumented production changes is with runtime monitoring."

The gate-only posture leaves the lake unmonitored. The runtime change detection pattern closes the gap by continuously diffing production state against an authoritative change record — any observed change that doesn't match a documented change is a detected drift event, to be triaged.

How to detect them

Surface Detection method
Binary / container image version Periodic poll of running versions vs. declared release manifest
Config files File-integrity monitoring (e.g. Tripwire-class tools, systems/aws-devops-agent runtime config diffs)
Kubernetes manifests GitOps reconciliation drift alerts
Feature flags Audit-trail diff of flag state vs. declared ruleset — see concepts/audit-trail
Database schema pg_dump / mongodump diff vs. migration history
Cloud infra AWS Config / GCP Asset Inventory drift detection
Network rules Periodic firewall-rule inventory diff

The unifying property: each detection path compares observed state against a declared state and flags the delta. This is the same inversion that GitOps formalizes for Kubernetes — make the desired state the version-controlled artefact, and anything else is by construction a drift.

Relationship to Knight Capital

The source post cross-references the Knight Capital 2012 incident (see systems/knight-capital-smars), in which a partial deployment left old and new code paths running simultaneously on different servers. Knight didn't technically have an undocumented change — but it had an undocumented deployment state: nobody could enumerate, in real time, which servers were running which code version. The operational failure mode is the same: the declared state and the observed state diverged, and nobody was monitoring the observed state.

Seen in

Last updated · 319 distilled / 1,201 read