Skip to content

HIGHSCALABILITY 2023-08-16 Tier 1

Read original ↗

High Scalability — The Swedbank Outage shows that Change Controls don't work

Summary

A 2023-08-16 opinion-analysis piece on High Scalability (authored by a Kosli engineer, republished on Hoff's blog) using the April 2022 Swedbank outage and its SEK 850M (~USD 85M) regulatory fine from Finansinspektionen (the Swedish Financial Supervisory Authority) as a case study for why traditional change management processes — Change Advisory Boards, manual approvals, service windows — do not actually reduce risk in modern IT organizations. The post triangulates three data points: (1) Swedbank's own regulator finding that "none of the bank's control mechanisms were able to capture the deviation" despite documented change controls; (2) the UK FCA's empirical study of ~1M production changes where CABs approved >90% of major changes and some firms rejected zero changes during 2019; and (3) the DORA research from the Accelerate book (Forsgren, Humble, Kim, 2018) finding that external approvals are negatively correlated with lead time, deployment frequency, and restore time, and uncorrelated with change-fail rate"worse than having no change approval process at all." The author's thesis: the problem is unaddressed risk, not change itself; risk reduction comes from smaller, more frequent releases + runtime monitoring + fast rollback, not paperwork.

Key takeaways

  1. A change that bypassed the process collapsed one of Europe's largest retail banks for a day. "Swedbank had a major outage in April 2022 that was caused by an unapproved change to their IT systems. It temporarily left nearly a million customers with incorrect balances, many of whom were unable to meet payments." The Finansinspektionen finding: "The deficiencies that were present in Swedbank's internal control made it possible to make changes to one of the bank's most central IT systems without following the process in place at the bank to ensure continuity and reliable operations." The regulator's full toolkit included revoking the banking licence; the sanction was "limited to a remark and an administrative fine" of SEK 850M (~USD 85M).

  2. The regulator's framing is self-referential — and unhelpful. The bank committed to a change-management process → the process wasn't followed → the bank was fined for the breach and for "insufficient controls". "The position of the regulator constitutes self-referential logic. You said you'd do something to manage risk, it wasn't done, therefore you are in violation." This creates a compliance theatre dynamic: the bank's rational next move is to add more change controls to hedge against future fines — whether or not those controls reduce real risk.

  3. CABs approve virtually everything — they are not a filter. Citing the UK FCA's multi-firm review of implementing technology change: "CABs approved over 90% of the major changes they reviewed, and in some firms the CAB had not rejected a single change during 2019. This raises questions over the effectiveness of CABs as an assurance mechanism." A 100% approval rate is definitionally not a control. See patterns/cab-approval-gate.

  4. External approvals fail the DORA correlation test, on all four axes. The Accelerate / State of DevOps research (concepts/dora-metrics) finding, quoted verbatim: "We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a change manager or CAB) simply doesn't work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all." (Emphasis added.) This is the load-bearing empirical claim of the entire post.

  5. Change management reduces undocumented change, not change risk. The subtle but critical distinction: "Change management gathers documentation of process conformance, but it doesn't reduce risk in the way that you'd think. It reduces the risk of undocumented changes, but risks in changes that are fully documented can sail through the approval process unnoticed." Most CAB-reviewed changes are rubber-stamped because the reviewers lack the technical depth to evaluate specific production risk in a 30-minute meeting — so the documentation exists but the risk review is performative.

  6. Frequent, small releases correlate with fewer incidents. UK FCA follow-on finding: "firms that deployed smaller, more frequent releases had higher change success rates than those with longer release cycles. Firms that made effective use of agile delivery methodologies were also less likely to experience a change incident." This is the positive half of the DORA story — reducing per-change blast radius is the actual risk-reduction lever. See patterns/small-frequent-releases-for-risk-reduction.

  7. The stream-into-lake metaphor: gates on inflow don't replace monitoring of the stored state. "We can think of software changes as streams, feeding into our environments which are lakes. Change management puts a gate in the stream to control what flows into the lake, but doesn't monitor the lake. If it is possible to make a change to production without detection, then change management only protects one source of risk. The only way to be sure you don't have undocumented production changes is with runtime monitoring." See patterns/runtime-change-detectionconcepts/undocumented-production-change can only be caught by continuously diffing production state against an authoritative change record.

  8. Knight Capital is the cross-reference — same failure shape, pre-DevOps era. The Knight Capital 2012 incident (SEC filing) is invoked as the canonical prior art: incomplete understanding of which code was actually running in production, caused by a partial deployment that left old + new code paths coexisting on different SMARS servers, drove a $460M trading loss in 45 minutes. "In both cases, an incomplete understanding of how changes have been applied to production systems due to insufficient observability and traceability prolonged and amplified the scale of the outages." See systems/knight-capital-smars.

  9. Financial services legacy is a systemic risk, not an excuse. The author's uncomfortable point: the reason banks can't adopt continuous delivery / DevSecOps isn't lack of will — it's "legacy systems and outsourcing where implementing these practices is technically challenging and uneconomic." Fiat, that's a regulator problem: "Maybe it is time we acknowledge legacy software, risk management, and outsourcing are a major systemic risk in the financial sector?" See concepts/legacy-system-risk.

  10. Paperwork ≠ risk reduction; smaller changes + monitoring ≠ risk reduction. The closing synthesis: "paperwork doesn't reduce risk. Less risky changes reduce risk." The operational prescription: smaller, more frequent changes + automated change-control and documentation + runtime monitoring and alerting to detect unauthorized changes — i.e., a DevSecOps approach that harmonises speed of delivery with audit/compliance demands. This is the same pattern surfaced by more recent incidents in this corpus: see e.g. GitHub's "When protections outlive their purpose" for the observation that ossified protection layers can themselves become incident-causing latent misconfigurations.

Numbers disclosed

  • Swedbank outage (2022-04): ~1M customers with incorrect balances; outage lasted ~1 business day ("temporarily left nearly a million customers… many of whom were unable to meet payments").
  • Fine: SEK 850M ≈ USD 85M. Sanction scale context: full toolkit included revoking the banking licence.
  • UK FCA study scope: ~1M production changes analysed across a multi-firm review.
  • CAB approval rate: >90% of major changes approved; some firms had 0% rejection across all of 2019.
  • Accelerate / DORA: external approvals negatively correlated with 3 of 4 DORA metrics (lead time, deployment frequency, restore time), uncorrelated with change-fail rate. (concepts/dora-metrics.)
  • Knight Capital (2012, referenced): $460M loss in 45 minutes (from the SEC filing, not restated in the post verbatim but implied by the cross-reference).

Numbers NOT disclosed

  • No technical details of the Swedbank change: what system, what kind of change (schema migration, config push, code deploy), what the pre-change vs post-change behavior was, what the balance-corruption mechanism was. The regulator's judgment "doesn't describe the technical details behind the incident."
  • No MTTR breakdown: how long detection took, how long root-cause analysis took, how long rollback took. The regulator notes only that non-compliance "probably… resulted in a slower analysis of the incident and a greater impact on the operations."
  • No "how many similar changes have been made that didn't cause an outage" — the author explicitly names this as the open question ("without monitoring it is really hard to know").
  • No comparative data on change-fail rate for firms using agile delivery vs firms using CABs — the FCA paraphrase is qualitative.
  • No specific Accelerate effect-size numbers (β coefficients, sample sizes) — only the direction of the correlations.

Caveats

  • The author has a commercial interest. The post is authored by a Kosli engineer; Kosli sells a change-tracking and runtime-monitoring compliance product. The conclusion — "you need runtime monitoring of production state" — is also a sales pitch. The underlying FCA and Accelerate data is not Kosli's, but the framing is self-interested. Read alongside the primary sources: Finansinspektionen's ruling, the UK FCA review, and the Accelerate book directly.
  • "External approvals don't work" is a DORA finding, not a universal law. The DORA sample is skewed toward tech-forward organizations; the result may not generalize to highly regulated settings where a CAB is one of multiple overlapping controls (separation of duties, SOX, PCI-DSS). The FCA's own "smaller, more frequent releases had higher change success rates" finding is in regulated UK financial services, which does support the generalization — but the post doesn't grapple with the counter-case.
  • The Knight Capital analogy is imperfect. Knight was a deployment consistency failure (old + new code paths running simultaneously across 8 SMARS servers); Swedbank was a process bypass failure (change not documented or approved). Both point at insufficient observability of the actual production state, but the root causes are different shapes.

Source

Last updated · 319 distilled / 1,201 read