Skip to content

ZALANDO 2025-09-24

Read original ↗

Zalando — Dead Ends or Data Goldmines? Investment Insights from Two Years of AI-Powered Postmortem Analysis

Summary

Zalando's SRE-adjacent datastore team built a multi-stage LLM pipeline to mine thousands of archived postmortems for recurring failure patterns across their five Postgres / DynamoDB / S3 / AWS ElastiCache / Elasticsearch datastore tiers. The canonical contribution is a named architectural trade-off: instead of stuffing a large-context frontier model with thousands of documents, they chain narrow single-objective LLM stages in a map-fold shape — Summarization → Classification → Analyzer → Patterns → Opportunity — to dodge the "lost in the middle" effect, cap per-document latency, and keep every stage's input and output human-readable for curation. Two years of running the pipeline surfaced a cross-datastore finding: configuration & deployment plus capacity & scaling dominate datastore incidents, and drove a concrete investment — automated change validation for infrastructure-as-code that "shield[ed] us from 25% subsequent datastore incidents."

Key takeaways

  • Postmortems as an under-exploited corpus. Zalando has "thousands of archived postmortem documents" accumulated from their Google-SRE-book-shaped (systems/google-sre-book) postmortem culture. At ~15–20 minutes per human read, this corpus is effectively unreadable by any individual — "strategic questions like 'Why datastores fail most frequently at scale?' become impossible to answer quickly." The canonical reframing is postmortems as a data goldmine: text-at-scale is an LLM-native workload.
  • NotebookLM gave a 3× speedup but not enough. Their first iteration wrapped Google's NotebookLM around the corpus and "boosted productivity three times" — but "sifting through summaries takes weeks for a dedicated team of experts, still not allowing us to answer questions quickly", and the large-context shape produced "severe hallucinations and loss of the incident context." Caused the pivot to a multi-stage pipeline architecture.
  • Multi-stage pipeline over single large-context prompt. Core architectural trade-off, stated verbatim: "we designed a multi-stage LLM pipeline instead of using high-end LLMs with large context windows. It is a deliberate design trade-off aimed at simplicity and reliability." Motivation: lost in the middle"details in the middle of long inputs are often overlooked or distorted" — plus latency, memory, and cost pressure. Canonicalised as patterns/multi-stage-llm-pipeline-over-large-context (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis).
  • Five stages, each strictly single-objective. The pipeline is: (1) Summarization — condenses each postmortem into five core fields (Issue Summary / Root Causes / Impact / Resolution / Preventive Actions). Uses TELeR prompt engineering with strict constraints: "no guessing, no assumptions, and no speculative content." (2) Classification — sorts summaries into technology-specific buckets. Returns "only the name of technologies with a confirmed direct connection or 'None' if there is no such link." (3) Analyzer — produces a 3–5-sentence digest "that highlights (a) the root cause or fault condition involving the technology; (b) the role it played in the overall failure scenario; (c) any contributing factors or interactions that amplify the issue." (4) Patterns — feeds "the entire set of incident digests into LLM within a single prompt" to surface recurring failure themes. (5) Opportunity — cross-references patterns against the postmortem corpus to produce investment proposals. The pipeline is described as a "map-fold" functional pattern: "a key building block for the pipeline" — map over documents for extraction, fold via LLM or deterministic function for aggregation. See concepts/map-fold-llm-pipeline.
  • Three dominant obstacles: hallucination, surface attribution error, latency. "Within each of these environments, Hallucination, Surface Attribution Error and Latency are three key obstacles." Quantified:
  • Small models (3B–12B params) exhibited up to 40% hallucination at summary and analysis phases — "fabricated a plausible summary regarding a non-existent DynamoDB incident, solely because DynamoDB was mentioned in the title of a playbook linked to the postmortem."
  • After prompt hardening + human curation, the small-model hallucination rate dropped to <15%; transitioning to Claude Sonnet 4 on AWS Bedrock made hallucinations "negligible."
  • Surface Attribution Error persists even at frontier scale: "approximately 10% attribution, even with advanced models such as Claude Sonnet 4." Canonical example: the model confidently attributes an S3 failure because "'S3' is merely mentioned without being causally linked."
  • Latency budget: "the overall document processing time should not exceed 120 seconds; otherwise, the processing of annual data becomes impractically long." Initial 27B-param open-source model consumed 90–120 s alone — no bandwidth for pipeline stages. The multi-stage (3B + 12B + 27B) design cut per-document classify to 20 s and analysis to 60 s, enabling "processing of annual data analysis in under 24 hours." Claude Sonnet 4 processes each postmortem in ~30 s.
  • Model evolution driven by compliance, not capability. Initial prototypes used open-source models in LM Studio (on-prem). Transition to Claude Sonnet 4 on AWS Bedrock was "primarily driven by compliance topics rather than technical necessity" — postmortems contain PII of on-call responders, GMV loss numbers, etc.; legal alignment was "a pre-condition before using cloud hosted LLMs." Canonical datum that LLM platform choice for regulated text is a legal-review gate, not a performance optimisation.
  • 100% human curation during development, 10–20% at maturity. "During the pipeline development, we conducted 100% human curation of output batches." Curation was upvote / downvote labelling; feedback refined prompts and selected models per stage. "As the system matured, we relaxed human curation to 10-20% of randomly sampled summaries from each output batch. We are still using human expertise to proofread the final report." Instance of periodic-sample HITL with an explicit curation-rate-over-time glide path.
  • Concrete investment outcome quantified. Cross- datastore analysis identified two recurring pattern clusters: AWS S3 incidents "consistently tied to misconfigurations in the deployment artifacts preventing applications from accessing S3 buckets" → automated change validation for infrastructure-as-code that "is able to shield us from 25% subsequent datastore incidents." AWS ElastiCache "consistent trend of 80% CPU utilization causing elevated latency at peak traffic" → strategic capacity-planning, instance-type selection, and traffic-management direction. These are load-bearing: the LLM pipeline's ROI is measured in prevented-incident percentage, not model quality score.
  • Two PostgreSQL-bug-driven incidents over 5 years. The post surfaces the actual failure tail of Zalando's datastore fleet: "incidents very rarely directly attributed to technological flaws" — but they disclose two: (a) AUTOVACUUM LAUNCHER race condition on Postgres 12 terminating all connections in the pool (a "known bug"); (b) Postgres 16→17 major-version upgrade triggered a logical replication bug ("memory leak" when DDL runs in parallel with a large number of transactions). Extends the wiki's existing axis 10 narrative on JDBC patching with the more-recent upstream-logical-replication bug class.
  • "Read between the lines" is an explicit goal. The closing framing — "Your incidents hold the blueprint to your most strategic infrastructure wins - if you are listening correctly" — positions the LLM pipeline as a strategic decision-support tool for engineering leadership, not a productivity optimiser for on-call.

Pipeline architecture (from the post's own table)

Stage Goal Input Output
Summarization Reduce reviewer load by condensing postmortem narratives into few data points. Postmortem corpus Summary corpus
Classification Enable technology-specific clustering across incidents. Identity of technology buckets; Summary corpus N-buckets, each containing postmortem summaries relevant to technology
Analyzer Convert summaries into thematic failure fingerprints. The bucket of summaries The bucket of digests, each ≤ 5 sentences describing the tech's role
Patterns Detect systemic issues over time. The bucket of digests One-pager: role of technology in all incidents + patterns
Opportunity Convert patterns into investment recommendations. Patterns of technology incidents; Postmortem corpus Investment opportunity

(Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)

Recurring failure patterns surfaced (from 2 years of data)

  • Absence of automated change validation at config and infrastructure-as-code; poor visibility into changes and their effects.
  • Inconsistent / ad-hoc change management practices including manual intervention.
  • Absence of progressive delivery with datastores (canary / blue-green).
  • Underestimating traffic patterns; failing to scale ahead of demand or delayed auto-scale responses.
  • Bottlenecks due to memory, CPU, or IOPS constraints.
  • "Hidden hotspots" surfaced by the pipeline that were previously considered stable: improper connection pool configuration, missing circuit breakers leading to cascading failures.

Canonical example digest (LLM output, censored in the post):

"DynamoDB contributed to this incident as the affected data store, but was not the root cause of the failure. The root cause was a version incompatibility between an upgraded AWS SDK (2.30.20) and an older DynamoDB support module (2.17.279) that still depended on a class removed in the newer SDK version. This dependency mismatch caused all DynamoDB write operations to fail with a NoClassDefFoundError, which cascaded to affect multiple [SERVICES] that relied on DynamoDB for storing [DATA]. DynamoDB itself functioned normally — the issue was entirely due to the application's inability to properly connect to and interact with DynamoDB after the SDK upgrade."

The digest demonstrates the Analyzer stage's value: the real failure is application-side SDK-dependency drift, not a DynamoDB fault — precisely the class of attribution the surface-attribution filter exists to get right.

Caveats / gaps

  • No disclosed corpus size. "Thousands" of postmortems is the only quantity stated; exact N, growth rate, and per-year ingestion not disclosed.
  • No disclosed accuracy metric for the Patterns stage. Hallucination rates are disclosed for Summarization and Analyzer stages; the cross-incident Patterns stage's failure rate is not quantified — human proofreading is named as the final gate but the disagreement rate between LLM-generated and human-approved patterns isn't shown.
  • No disclosed per-stage model assignment. The post names "multiple models 3B, 12B and 27B" for the per-document stages historically, then "Claude Sonnet 4" overall for the current iteration, but the current per-stage split (which stage uses which model) is not documented.
  • Numerical-data extraction remains unsolved. "Reliable accuracy in extracting numerical data, such as GMV or EBIT loss, affected customers, and repair time, from postmortems was not achieved." Zalando routes around this by falling back to their internal incident dataset — "a trustworthy source of truth for opportunity analysis." Implies a hybrid pipeline: LLMs for narrative + structured incident database for numerics, not one-model-does-everything.
  • Surface attribution error is measured but not fully solved. "approximately 10% attribution" persists at Claude Sonnet 4 — the 10% tail is disclosed but no mitigation roadmap beyond negative prompting is given.
  • Zalando-proprietary tech is systematically mis- analysed. "unacceptable analysis of incidents concerning Zalando internal technologies (e.g. Skipper)" — the pipeline only performs well on public technologies. Mitigation roadmap is "model fine-tuning", listed as future work rather than shipped.
  • No-code agentic solution was deemed unfeasible. "The initial concept of a no-code agentic solution was quickly deemed unfeasible due to performance limitations, inaccuracies, and hallucinations encountered during prototype development." Explicit rejection of the single-agent-does-everything frame in favour of the multi-stage human-inspectable pipeline.
  • Scope is datastores only. Pipeline ran against five datastore technologies on the Zalando [[concepts/ tech-radar-language-governance|Tech Radar]]; whether it scales to application-layer postmortems, network incidents, or organizational post-mortems isn't addressed.
  • Pipeline ownership model unspecified. "A group of colleagues looking after the datastores" built this — but whether it's now operated by a central SRE team, the datastore team, or a platform/AI team is not stated. Sustainability under ownership handoff is a known LLM- pipeline risk not addressed in the post.
  • The Opportunity stage is least described. The stage's input/output is listed in the table, but the post does not walk through an Opportunity-stage output the way it walks through Summarization, Classification, and Analyzer outputs. Inferred to be a human-curated layer on top of the Patterns output given the ROI claims (25% prevented-incident rate, 80% CPU ceiling hotspot) read as human-authored strategic narratives.

Source

Last updated · 507 distilled / 1,218 read