SYSTEM Cited by 1 source

Zalando Postmortem Analysis Pipeline¶

What it is¶

An internal LLM-powered analytics pipeline Zalando's datastore team built to mine "thousands of archived postmortem documents" for recurring failure patterns and strategic investment opportunities across their five core datastore technologies: Postgres, AWS DynamoDB, AWS S3, AWS ElastiCache, and Elasticsearch (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis).

The pipeline is the canonical wiki instance of patterns/multi-stage-llm-pipeline-over-large-context applied to an SRE corpus. It is also Zalando's first publicly disclosed production LLM system targeting operational decision support rather than customer- facing content generation, distinct from the Content Creation Copilot (catalog attribute extraction).

Architecture¶

Five stages arranged in a map-fold shape — map (per-document extraction) × fold (cross-document aggregation):

Summarization. Per-postmortem condensation into five core fields — Issue Summary / Root Causes / Impact / Resolution / Preventive Actions. Uses TELeR (Turn, Expression, Level of Details, Role) prompt engineering with strict constraints: "no guessing, no assumptions, and no speculative content." When the postmortem is ambiguous, "the summary explicitly states that" — structural refusal-when-uncertain over confabulation.
Classification. Per-summary bucketing against the Zalando Tech Radar technology list. Prompt returns "only the name of technologies with a confirmed direct connection or 'None' if there is no such link." Hardened with negative examples specifically targeting the surface attribution error failure mode — the classifier must not tag a technology just because it appears in the document's surface text.
Analyzer. Per-summary 3–5-sentence digest — "(a) the root cause or fault condition involving the technology; (b) the role it played in the overall failure scenario; (c) any contributing factors or interactions that amplify the issue." The digest is the interpretive stage — where the pipeline produces the reusable artefact, not just a compressed reformulation. Designed for 30–60-second human read.
Patterns. The fold stage: "feeding the entire set of incident digests into LLM within a single prompt" and prompting for a one-pager list of recurring themes. "Explicitly prohibiting inference, redundancy, or the inclusion of any information not grounded in the source data." Uses the full set of digests because their bounded size (≤ 5 sentences each) lets them fit in context even at "thousands" scale.
Opportunity. Cross-references patterns against the original postmortem corpus to produce investment proposals. Least-described stage in the post — inferred to be a human-authored strategic narrative layer rather than a pure LLM output, given the disclosed outputs read as executive-facing ("25% subsequent datastore incidents shielded", "80% CPU utilization causing elevated latency at peak traffic").

Per Zalando's framing:

"the pipeline sifts through high-entropy information and distill[s] it into concise reasons for failure." (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)

Model evolution¶

Three-generation journey, explicitly documented:

Generation 0: NotebookLM wrapper. Google NotebookLM run over the corpus, producing document summaries. "Boosted productivity three times" — per-document reading time from 15–20 minutes to ~5 minutes — but "sifting through summaries takes weeks for a dedicated team of experts, still not allowing us to answer questions quickly. We have also observed severe hallucinations and loss of the incident context by LLM while producing summaries." The lost in the middle diagnosis comes from here.
Generation 1: Small models, on-prem in LM Studio. Open-source models 3B → 12B → 27B parameters, hosted in LM Studio. Hallucination rate started at "up to 40% probability" — "anecdotally, small models fabricated a plausible summary regarding a non-existent DynamoDB incident, solely because DynamoDB was mentioned in the title of a playbook linked to the postmortem." Prompt hardening + 100% human curation reduced hallucination to < 15%. Per-document latency: 27B alone consumed 90–120 s (blocking further stages); the multi-model chain (3B + 12B + 27B) dropped to ~20 s classify + 60 s analyse, enabling "processing of annual data analysis in under 24 hours."
Generation 2: Claude Sonnet 4 on AWS Bedrock. Current production iteration. Hallucinations "negligible." Per-postmortem processing: ~30 s. Drove the Gen-0 → Gen-2 transition was compliance, not capability: "Postmortem document[s] contain PII data of on-call responders, companies business metrics, GMV losses, etc. The legal alignment was a pre-condition before using cloud hosted LLMs (e.g. AWS Bedrock)." Canonical datum that cloud LLM adoption for regulated text is a legal-review milestone, not a latency / quality one.

Residual failure mode persists across all generations: surface attribution error at "approximately 10% attribution, even with advanced models such as Claude Sonnet 4."

Human curation glide path¶

Explicit rate-over-time schedule:

Development phase: 100% human curation. Every output batch labelled upvote / downvote. Feedback loop refined prompts and selected per-stage models.
Maturity phase: 10–20% random-sample curation. "As the system matured, we relaxed human curation to 10-20% of randomly sampled summaries from each output batch. We are still using human expertise to proofread the final report applying editorial changes to summary and incident patterns."

This is the canonical wiki instance of patterns/human-in-the-loop-quality-sampling with explicit curation-rate decay as the system's trust accumulates — not a static 100% nor a static low-rate sampling discipline. Human proofreading at the Patterns-stage output (one-pager report) remains a non-negotiable gate even at maturity, because "novel failure modes that AI may have overlooked" still pass through as hallucinations the curator catches.

Outcomes¶

Two-year operating disclosure, verbatim:

"AWS S3 incidents: consistently tied to misconfigurations in the deployment artifacts preventing applications from accessing S3 buckets, often due to manual errors or untested changes. This insight directly led to the solution for automated change validation for infrastructure as code which is able to shield us from 25% subsequent datastore incidents, demonstrating a clear return on investment."

"AWS ElastiCache incidents: a consistent trend of 80% CPU utilization causing elevated latency at peak traffic. This AI-driven insight led us developing a strategic direction about capacity planning, instance type selection and traffic management for AWS ElastiCache." (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)

Plus a cross-datastore taxonomy of recurring incident clusters (configuration & deployment; capacity & scaling; manual change management; missing progressive delivery; connection-pool / circuit-breaker "hidden hotspots") and two specific technological failure disclosures that previously lived only inside postmortems: - Postgres 12 AUTOVACUUM LAUNCHER race condition terminating all pool connections (see systems/pgjdbc-postgres-jdbc-driver sibling incident class). - Postgres 16→17 upgrade triggering a logical replication memory-leak bug when DDL runs in parallel with many transactions.

What it's not¶

Not a real-time on-call tool. Pipeline is offline / batch — processes annual corpora. No low-latency API.
Not a replacement for per-incident postmortems. The pipeline consumes postmortems, it doesn't write them.
Not an agentic system. "The initial concept of a no-code agentic solution was quickly deemed unfeasible due to performance limitations, inaccuracies, and hallucinations encountered during prototype development." Each stage is a deterministic prompt with a single objective — the pipeline is the control structure, not an agent loop.
Not a replacement for numerical incident metrics. "Reliable accuracy in extracting numerical data, such as GMV or EBIT loss, affected customers, and repair time, from postmortems was not achieved." Zalando falls back to their internal incident dataset (structured database) for numerics — a hybrid narrative + structured database discipline that avoids stretching LLMs into numeric-extraction reliability territory.
Not fine-tuned. All stages use base models via prompting + curation. Fine-tuning is named as future work specifically to handle Zalando-internal technologies (named example: Skipper) where the base model's public-corpus training produces "unacceptable analysis."

Seen in¶

sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis — canonical disclosure.

systems/amazon-bedrock — the hosting platform chosen for the current Claude Sonnet 4 generation on compliance grounds.
systems/notebooklm — the Gen-0 tool demonstrating the lost-in-the-middle limit.
systems/lm-studio — the Gen-1 on-prem hosting.
systems/postgresql, systems/dynamodb, systems/aws-s3, systems/elasticsearch — four of the five datastore technologies under analysis.
concepts/map-fold-llm-pipeline — the functional architecture primitive.
concepts/lost-in-the-middle-effect — the failure mode that motivated the multi-stage design.
concepts/surface-attribution-error — the residual ~10% failure mode at Claude Sonnet 4 tier.
concepts/teler-prompt-framework — the prompt- engineering technique at Summarization.
concepts/llm-hallucination — the cross-stage failure mode with disclosed rate evolution.
concepts/postmortem-as-data-goldmine — the corpus- reframing motivation.
patterns/multi-stage-llm-pipeline-over-large-context — canonical wiki pattern.
patterns/negative-example-prompting — the per-stage prompt-hardening technique.
patterns/human-in-the-loop-quality-sampling — the 100% → 10–20% curation-rate glide path.
companies/zalando — the originating company.