Skip to content

CONCEPT Cited by 1 source

Postmortem as data goldmine

Definition

Postmortem as data goldmine is the strategic reframing — canonicalised in Zalando's 2025-09-24 datastore-team postmortem-analysis-pipeline post — that an organization's archived postmortem corpus is an under-exploited decision- support asset, not just a historical record. The posts are evidence-grade narrative records of what actually failed in production, at the altitude that an archived incident-database row cannot capture (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis).

The Zalando framing, verbatim:

"We adopted LLMs as an intelligent SRE assistant to analyze thousands of postmortems, transforming them from 'dead ends' into 'data goldmines.'"

"Your incidents hold the blueprint to your most strategic infrastructure wins — if you are listening correctly."

Why postmortems are under-exploited

A Google-SRE-book-shaped postmortem culture produces a corpus with unusual properties:

  • Human-authored narrative, not just structured fields. The why of the failure is in the text.
  • Cross-team review. Postmortems pass through multiple teams (responsible team + stakeholders + adjacent teams + engineering leadership sign-off).
  • Unusually honest. Blameless culture, shared-learning framing — the incentive is to document real contributing factors, not defend against attribution.

But the same properties that make postmortems valuable make them hard to mine:

  • Varying depth and clarity. "Postmortems vary widely in depth and clarity. Comparing them and extracting patterns is often perceived as apples-to-oranges."
  • Team-local assumptions. "Root cause analyses reflect team assumptions, subtle contributing factors often go unspoken."
  • Cognitive-load wall at scale. "It takes about 15-20 minutes to thoughtfully read a single postmortem (a dedicated reviewer can process maybe four postmortems per hour)." Multiplied by thousands of postmortems, strategic cross-incident questions become "impossible to answer quickly, or without excessive cognitive load."

Zalando's framing of the cost:

"When your learning about site reliability depends exclusively on human effort, scale becomes the enemy."

The consequences of not mining the corpus are structural:

  • Missed systemic signals that could inform infrastructure investment.
  • Reacting to symptoms instead of addressing root causes.
  • Delayed decisions due to insufficient insights across domains.

Why LLMs change the economics

Text at corpus scale is an LLM-native workload:

  • Read velocity. A per-document LLM summarization stage operating at ~30 s per postmortem lets an annual corpus be processed in hours instead of weeks.
  • Cross-document aggregation. The map-fold shape lets recurring patterns surface without a human having to hold the entire corpus in head.
  • Pattern → investment. The Patterns + Opportunity stages of Zalando's pipeline turn the raw corpus into a prioritised list of preventable-incident-clusters the organization can invest to remove.

What this is not

  • Not real-time on-call guidance. Postmortem mining is a strategic-altitude analysis — answer to "what should we invest in next year?", not "what's happening right now?". The output feeds capacity-planning and architecture- review conversations, not alert-response runbooks.
  • Not a replacement for the per-incident postmortem. The pipeline consumes postmortems; it does not write them. Blameless per-incident RCA remains the primary reliability practice. The goldmine only exists because the per-incident discipline exists.
  • Not a numerical-metrics substitute. Zalando disclose they could not reliably extract GMV, EBIT, customer- count, or repair-time numbers from postmortems with LLMs — these are routed back to a structured incident database. The goldmine is narrative / causal, not quantitative.
  • Not sufficient for strategic decisions without human curation. The Opportunity stage's outputs read as human-authored narratives built on top of LLM-generated patterns. Mining + humans, not mining alone.

Prerequisites

For the goldmine reframing to pay off, an organization needs:

  • An archived postmortem corpus. Measured in at least hundreds of postmortems; ideally thousands. Under ~200 postmortems, the investment in a pipeline may not be justified vs. human-led reviews.
  • Postmortem discipline applied consistently. Blameless culture + consistent field structure + leadership sign-off. Without discipline the corpus is too noisy.
  • LLM platform clearance. Postmortems contain PII of on-call responders, business metrics (GMV, EBIT), partner impact — legal / compliance review is a hard prerequisite before uploading to cloud LLMs. Zalando disclose this as the driver of their on-prem Gen-1 → cloud Gen-2 migration timeline.
  • Human curation bandwidth. Even at maturity, the Patterns-stage output is proofread by domain experts. Without the human capacity to curate, the pipeline's output quality degrades because residual surface-attribution errors propagate unchecked.

Seen in

  • sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysiscanonical wiki instance. Zalando's datastore team names the reframing in the post title ("Dead Ends or Data Goldmines?"), quantifies the cognitive-load wall (15–20 min / postmortem, 4 postmortems / hour upper bound), catalogues the three strategic costs of not mining the corpus, and demonstrates the pay-off (25% prevent-rate on follow-up S3 incidents, 80% ElastiCache CPU ceiling hotspot surfaced) on a two-year operating horizon.
Last updated · 507 distilled / 1,218 read