Zalando¶

What they do¶

Zalando SE is Europe's largest online fashion and lifestyle platform, headquartered in Berlin. Beyond commerce, Zalando is known in the systems community for heavy open-source contribution around Postgres-on-Kubernetes and for in-house experimentation platform work:

Postgres Operator — open-source Kubernetes operator for managing Postgres clusters with high availability, connection pooling, and backups.
Skipper — HTTP router and reverse-proxy written in Go, used as the default Kubernetes Ingress proxy across 140+ Zalando Kubernetes clusters (Source: ).
Kubernetes Ingress Controller for AWS — Zalando-incubator controller that provisions an AWS ALB + ACM cert for each Ingress.
Patroni — HA template for PostgreSQL (via Python + DCS), used industry-wide.
Spilo — Docker image for running Postgres with Patroni.
Octopus — in-house A/B testing / experimentation platform, released 2015; analysis system rebuilt on systems/apache-spark over ~2 years. Source of Zalando's crawl/walk/run experimentation- platform retrospective ().

Zalando Engineering (engineering.zalando.com) is a Tier-2 source on the sysdesign-wiki: consistent output on distributed systems, Kubernetes, Postgres internals, cloud platform engineering, and experimentation / A/B-testing infrastructure, though the blog mixes in recruiting and product-announcement posts.

Wiki anchor axes¶

Zalando has twenty-eight canonical axes on the wiki. Axis 6 is a four-post series (2021-03 UBFF + 2021-04 errors + 2022-02 persisted-queries/schema-stability + 2023-10 directive-taxonomy). Axis 7 is now a four-post series (2021-03 micro-frontends + 2023-06 context-based Experience + 2023-07 concurrent-React + 2025-10 Rendering Engine on React Native) — the 2023-06 Experience installment adds request- scoped presentation context as an orthogonal IF primitive: a named Experience bundle of policies + selection rules resolved once at root-entity time and propagated via the Rendering Engine request state to every child-renderer FSA call (patterns/request-state-propagated-experience). Axis 5 is a six-part narrative (2020-10 Cyber Week + 2021-03 load-test automation + 2021-09 SRE tracing I + 2021-09 SRE tracing II + 2021-10 SRE tracing III + 2022-04 operation-based SLOs). Axis 14 opens with the 2024-01 metadpata postmortem — canonicalising supertools and the 5-layer destructive-automation blast-radius containment stack. Axis 15 (new, 2024-07) opens the end-to-end test probe tier as a direct downstream chapter of axis 5 — browser-altitude synthetic monitoring of CBOs via Playwright on a 30-minute cron. Axis 16 (new, 2024-09) opens the AI-assisted content onboarding / catalog-attribute copilot tier — Zalando's Content Creation Copilot behind a four-service aggregator contract (Content Creation Tool + Article Masterdata + Prompt Generator + OpenAI VLM backend), migrating launch-phase GPT-4 Turbo to GPT-4o with no downstream contract changes. Axis 17 (new, 2025-02) opens the ingress control-plane scaling / coalescing proxy tier downstream of axis 1 and the 2020-06-30 ingress-stack launch — Zalando fronts ~300 Skippers per cluster with a single Route Server that polls the Kubernetes API every 3 s and serves all Skippers via HTTP + ETag / 304, turning an N×-fan-out overload on etcd into a 1× poll + N× cheap-304 delta channel. Canonical wiki instance of concepts/control-plane-fan-out-to-kubernetes-api and patterns/control-plane-proxy-with-etag-cache; rolled out via three-mode off/shadow/exec. Axis 26 (new, 2026-04) opens the Skipper admission- time validation tier as the second Skipper-specific control-plane tier after axis 17 — Zalando plugs Skipper's own filter registry, predicate specifications, route parser, and backend parser into a Kubernetes validating admission webhook (ingress-admitter.teapot.zalan.do) so that kubectl apply on an Ingress or RouteGroup with an unknown predicate, bad filter arguments, or unparseable backend is rejected with the same error Skipper would give at request time. The architectural move is to treat Skipper's validator as a library and call it from both the admission path and the request path, so the two answers cannot drift (patterns/reuse-runtime-logic-on-admission-path). Scale framing: 250+ clusters, 15k+ ingresses, ~200k routes, 500k– 2M RPS — at that scale even 1% invalid routes is ~2,000 broken routes and real production risk. Rolled out tier-by-tier behind -enable-advanced-validation (concepts/feature-flag-rollback-for-validator) guided by the skipper_route_invalid{route_id, reason} metric (concepts/invalid-route-observability-metric), such that teams writing valid manifests observed no change whatsoever — canonical invisible rollout. Zavodskikh's test for the rollout shape passed when, asked at an internal conference how to enable it, the answer was "you don't need to — it's already on." Blast-radius class is control-plane-on-writes, not data-plane-on-traffic — a bad webhook would freeze CI/CD fleet-wide while live customer requests kept flowing on the old routing tables (concepts/control-plane-change-blast-radius). Upstreamed to open-source Skipper v0.24.18+. Axis 18 (new, 2025-09) opens the AI-powered postmortem analysis / fleet-insight-mining tier as a lateral companion to axis 5's SRE discipline — Zalando's datastore team runs a five-stage LLM map-fold pipeline ( Postmortem Analysis Pipeline) over "thousands of archived postmortem documents" covering Postgres / DynamoDB / S3 / ElastiCache / Elasticsearch, producing cross-incident failure- pattern reports and "investment opportunity" proposals for engineering leadership. Canonical wiki instance of patterns/multi-stage-llm-pipeline-over-large-context and of postmortems as data goldmine; also canonicalises concepts/lost-in-the-middle-effect and concepts/surface-attribution-error as distinct LLM failure modes the multi-stage architecture was designed around. Quantified 2-year outcome disclosed: automated change validation for infrastructure-as-code "shielded" 25% of subsequent datastore incidents; AWS ElastiCache 80% CPU ceiling surfaced as a strategic capacity-planning hotspot. Current generation runs on Claude Sonnet 4 on AWS Bedrock — the Gen-0 (NotebookLM) → Gen-1 (LM Studio on-prem with 3B/12B/27B open-source models) → Gen-2 (Bedrock frontier tier) transition was driven by compliance / legal review, not capability. Human curation glide path: 100% of outputs during development → 10–20% random sampling at maturity, with proofreading of the final Patterns-stage one-pager remaining a non-negotiable gate.

Postgres-on-Kubernetes / kernel-level latency (opened 2020-06-23) — empirical kernel-level measurement (perf, softirq tracepoints, network namespaces) combined with operator-level deployment pragmatism.
Frontend platform evolution / micro-frontends → entity-based composition → concurrent React (opened 2021-03-10, extended 2023-07-10) — Part 1 of a series on Zalando's second-generation frontend platform Interface Framework (IF; designed 2018, ~90% traffic by March 2021), which replaces the 2015-era Project Mosaic Fragment-based micro-frontend architecture with an entity- based page-composition model: pages are request-time trees of typed Entities (Product, Collection, Outfit) chosen by personalisation, and [[patterns/entity-to- renderer-mapping|Renderers]] (one-per-Entity-type React components) are the contribution unit. The Rendering Engine (Node.js + browser runtime) walks the tree, applies declarative rendering rules, and composes the output. Cross-cutting concerns (monitoring, consent, A/B testing via Octopus, design system, bundle-size optimisation) move into the platform, and every PR is gated by Lighthouse CI + Bundle Size Limits + Web Vitals. The axis pairs tightly with axis 6 (the GraphQL BFF is IF's data aggregation layer). The 2023-07-10 Rendering Engine Tales post (sources/2023-07-10-zalando-rendering-engine-tales-road-to-concurrent-react) extends the axis with a React 18 / concurrent-rendering migration chapter: each Renderer becomes a <Suspense> boundary; renderToPipeableStream + hydrateRoot replace renderToNodeStream + hydrate (A/B-measured Fashion Store impact: INP −5.69 % / FID −8.81 % / LCP −2.43 % / FCP −0.23 % / bounce −0.24 %, Catalog page biggest wins at FID −17.11 %); Zalando deliberately rejects hook-based render-as-you-fetch in favour of an Application- State layer outside React where resolveEntity writes data into a central store and a Redux-useSelector-shaped Connector hook returns data or throws a Promise (four upstream blockers drove this: SuspenseList experimental, useTransition not nested-Suspense-aware, hook-initiated fetch timing coupled to render order, React data-streaming cache not final). The stricter React-18 hydration surfaced a production hydration-mismatch taxonomy across "hundreds of Renderers": timers (fix: suppressHydrationWarning), timezone-localised dates (fix: explicit timeZone or backend-localize), a Safari-de-AT-Intl.NumberFormat runtime divergence where the thousand-separator differs between Safari's JavaScriptCore and Node.js's V8 (application-unfixable; only backend-localize works), and invalid HTML nesting (<div> inside <p>, <button> inside <button> — React-18 mismatch). General escape hatch: patterns/mount-gated-client-only-rendering. Observability pattern: only forward the first onRecoverableError per session to Sentry (patterns/first-error-only-hydration-error-reporting) because post-first-error the fiber-vs-DOM list alignment cascades. Deferred to future posts: ordered-streaming/ hydration technical details, final-architecture Fashion Store impact, React Server Components.
Upstream JDBC-driver contribution for Postgres logical replication (opened 2023-11-08) — Zalando's 2023-11 post (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver) pairs the SBOM-driven dependency-governance axis with the complementary ecosystem-contribution altitude: when Zalando's fleet-wide operational pain surfaces a runaway-WAL-growth failure mode in the Postgres logical-replication protocol as implemented by pgjdbc, Zalando diagnoses the root cause (pgjdbc not responding to Postgres KeepAlive messages that carry the server's current LSN), builds the pure fix (KeepAlive- message LSN advancement — ack the KeepAlive-reported LSN when all Replication messages are flushed), upstreams it as pgjdbc PR #2941 (merged 2023-08-31, shipped in pgjdbc 42.7.0), and rolls out a locally-built 42.6.1-patched backport to their fleet via a parallel prod-vs-test Docker image split (the patched image goes to test first, verified via a multi-day flat- WAL-size graph, then promoted to production). Canonical instance of patterns/client-driver-fix-over-application-workaround at the JDBC-driver altitude — complements the wiki's existing language-runtime upstream-the-fix instances (Meta jemalloc / WebRTC, Cloudflare V8 / Go, Netflix Java 21 virtual threads) with a database-driver-layer example. Load-bearing platform context: Zalando's low-code Postgres-sourced event-streaming platform runs "hundreds" of per-stream micro-applications embedding Debezium Engine (the embedded-library variant of Debezium, distinct from Kafka-Connect-hosted Debezium); the fleet-scale shape (many independent replication slots against shared-WAL Postgres primaries) is exactly the shape that exposes the WAL-pinning asymmetric-tables bug. First wiki canonical instances of pgjdbc (the JDBC-driver load-bearing every JVM Postgres client), of Debezium Engine (the embedded-library CDC shape distinct from Kafka Connect), of Zalando's event-streaming platform, and of the dummy-write heartbeat kludge (the industry-standard application- layer workaround the driver-level fix replaces). Two-year sequel (2025-12-18, sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale): Debezium subsequently hard-disabled the pgjdbc keepalive flush via withAutomaticFlush(false) in PR #6472 because the feature conflicted with Debezium's own LSN management and broke the offset-store contract for most Debezium users — blocking Zalando's upgrade path. Zalando upstreamed two remediation contributions to Debezium 3.4.0.Final (released 2025-12-16): (a) lsn.flush.mode (DBZ-9641 / PR #6881) — three-mode enum (manual / connector default / connector_and_driver) making the pgjdbc flush opt-in; canonical instance of patterns/opt-in-driver-level-lsn-flush. (b) offset.mismatch.strategy (DBZ-9688 / PR #6948) — four-strategy enum (no_validation default / trust_offset / trust_slot / trust_greater_lsn) letting operators pick which position source is authoritative on startup mismatch; canonical instance of patterns/authoritative-slot-over-authoritative-offset. Load-bearing architectural insight the post canonicalises: logical-replication position is tracked in two independent locations (Debezium offset store + Postgres replication slot), and the right reconciliation strategy depends on operator-side invariants Debezium cannot know. Zalando trusts the slot because they run Patroni-managed Postgres with slot-survives-failover discipline since the mid-2010s plus MemoryOffsetBackingStore since 2018 — structurally opposite to the Kafka-Connect-offset-topic posture of most Debezium users. Both contributions ship with boolean → enum auto-mapping (flush.lsn.source → lsn.flush.mode; internal.slot.seek.to.known.offset.on.start → offset.mismatch.strategy). Updated platform scale at publication: "hundreds of event streams" processing "hundreds of thousands of events per second" across 100+ Kubernetes clusters at peak; pre-disable production run on Debezium 2.7.4 + pgjdbc 42.7.2 was "nearly two years, processing billions of events, with zero detected data loss from this mechanism." First wiki canonical instances of systems/patroni (as a system), concepts/lsn-flush-mode, concepts/offset-mismatch-strategy, concepts/slot-vs-offset-position-tracking, concepts/memory-offset-backing-store, patterns/opt-in-driver-level-lsn-flush, patterns/authoritative-slot-over-authoritative-offset, and patterns/backward-compatible-config-migration.
Destructive-automation blast radius / supertool safety net (opened 2024-01-22) — Adrian Chifor's 2024-01 postmortem of Zalando's November 2022 DNS outage (sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools) coins the term supertool for applications and scripts that wield fleet-wide destructive authority and names the canonical failure mode: a single p-typo turning YAML field metadata into metadpata collapsed an account-lifecycle job's accounts-in-scope set to empty, which its decommission code path interpreted as "all accounts decommissioned", triggering fleet-wide Route 53 hosted-zone deletion across the AWS Organization. "All of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries." Recovery: DNS outage recovery via cached-entries-before-TTL- expiry and tiered essential-tooling → core-infrastructure → on-site restoration, with rotating Incident Commanders by expertise area keeping the response focused across phases. The post catalogues a 5-layer containment stack that becomes the canonical wiki recipe for destructive-automation blast- radius reduction: (a) scream test — 1 week of Network ACL isolation + DNS delegation removal before real decommissioning (concepts/scream-test-for-deletion); (b) cost-weighted deletion deferral — low-savings resources excluded from automation with a 7-day cost-threshold gate; (c) triple-redundant jsonschema validation — IDE autocomplete via systems/yaml-language-server + pre-commit hook + CI pipeline, all against one schema, plus cfn-lint for CloudFormation templates; (d) PR preview of CloudFormation ChangeSet — bot reads CreateChangeSet from every AWS account in the organisation, merges into a human-readable PR comment, drops the ChangeSet (concepts/cloudformation-changeset); (e) phased rollout across release channels — extends the existing Kubernetes cluster-rollout shape to AWS accounts via named categories (playground → test → infra → production, concepts/release-channel-rollout). The post also names accelerated deletion pacing for cost savings as an amplifier: "As part of cost-saving measures, the pacing of executing deletion operations was sped up" — cost-optimization on destructive automation is itself a reliability risk. Axis 14 is the canonical wiki anchor for infrastructure-change safety nets at the AWS-account-lifecycle altitude, pairing with axis 5 (SRE evolution) at the operational-reliability altitude and axis 7 (frontend-platform) at the change-management-tooling altitude.
AI-assisted content onboarding / catalog-attribute copilot (opened 2024-09-17) — the 2024-09 post (sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding) opens a new LLM/VLM-in-production axis for Zalando downstream of the existing ML-platform axis (2022-04) and the MDM axis (2021-07). The Content Creation Copilot is a four-service decomposition — Content Creation Tool (copywriter UI with purple-dot AI-provenance marker and pre-selected suggestions), Article Masterdata (system-of-record for opaque attribute codes + attribute sets per article type), Prompt Generator (the load-bearing orchestrator — materialises the prompt from the schema, runs the bidirectional code↔English translation layer, filters attributes via a category-attribute relevance mapping, selects input images per an empirical product-only- front > model-worn > other ranking), and the OpenAI LLM backend (GPT-4 Turbo at launch, migrated to GPT-4o). The system is explicitly framed as a model-agnostic aggregator: "we created an aggregator service - to integrate multiple AI services, leveraging a wider variety of data sources, such as brand data dumps, partner contributions, and images" — validated by the GPT-4 Turbo → GPT-4o swap with no downstream contract changes. The UX layer is the pre-select-with-visual-disclosure pattern: shift the copywriter's workload from enrich-then-QA to QA-only, displacing ~25% of the content-production timeline. Production numbers at publication: ~75% accuracy, ~50,000 attributes enriched per week across 25 markets. Weak-spot disclosure: long-tail fashion vocabulary (e.g. deep scoop neck) underperforms on balanced eval sets; unbalanced (production- representative) eval sets give higher headline numbers — an eval-set-design caveat that sits independent of model choice. First wiki canonical instances of systems/zalando-content-creation-copilot, systems/zalando-content-creation-tool, systems/zalando-article-masterdata, systems/zalando-prompt-generator, systems/gpt-4o, concepts/opaque-attribute-code-translation-layer, concepts/ai-provenance-ui-indicator, concepts/category-attribute-relevance-mapping, concepts/input-image-selection-tradeoff, patterns/model-agnostic-suggestion-aggregator, and patterns/pre-select-ai-suggestions-with-visual-disclosure. Sibling to Instacart's PARSE (2025-08-01) — same pattern and same concept in a different catalog domain (fashion vs. grocery); Zalando's copilot is the thinner, less- platformised, human-in-the-copywriting-loop production instance.
Ingress control-plane scaling / coalescing proxy pattern (opened 2025-02-16) — the 2025-02 post (sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster) opens a new control-plane-under-fan-out axis downstream of the infrastructure / platform-engineering line that runs through axis 1 (Postgres-on-K8s), the 2020-06-30 Skipper+ingress-stack launch, and axis 15 (es-operator partial failures). At ~180 Skipper pods per cluster × 200 clusters, each Skipper independently polled the Kubernetes API for Ingress + RouteGroup; etcd was overwhelmed and the API-server CPU throttled, threatening pod scheduling — canonical instance of concepts/control-plane-fan-out-to-kubernetes-api. Remediation: insert Route Server (routesrv) as a single coalescing proxy between Skipper and the Kubernetes API, polling every 3 s and serving Skippers over HTTP with ETag / 304 (concepts/etag-conditional-polling); Skipper keeps the last routing table if routesrv goes away (concepts/last-known-good-routing-table). The rollout itself is the second load-bearing artifact — a three-mode False / Pre / Exec config flag (patterns/three-mode-rollout-off-shadow-exec) where Pre is an explicit shadow mode in which operators git diff Skipper-computed vs routesrv- computed Eskip before any cluster flips to Exec. Kubernetes Informers explicitly rejected because they preserve the N× fan-out shape at change events. Result: Skipper HPA raised from ~180 to 300 pods/cluster, zero downtime, zero GMV loss. First wiki canonical instances of systems/zalando-route-server, concepts/control-plane-fan-out-to-kubernetes-api, concepts/etag-conditional-polling, concepts/last-known-good-routing-table, concepts/polling-interval-as-freshness-budget, patterns/control-plane-proxy-with-etag-cache, and patterns/three-mode-rollout-off-shadow-exec.
LLM-powered code migration / frontend UI-library migration (opened 2025-02-19) — the 2025-02 post (sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries) opens a new axis distinct from but downstream of Axis 16's AI-assisted-content-onboarding copilot. Where Axis 16 canonicalises LLM-in-production at the catalog-enrichment altitude (four-service aggregator generating attribute suggestions for humans to QA), Axis 19 canonicalises LLM-in-engineering at the code-transformation altitude — a Python-based Component Migration Toolkit built by Partner Tech to migrate 15 B2B applications between two in-house UI component libraries. The load-bearing artefact is a frozen prompt with a three-layer Interface + Mapping + Examples composition — discovered through five offline iterative experiments (concepts/iterative-prompt-methodology) during an internal hackathon, not a runtime judge-LLM refinement loop. The mapping layer ("convert size='medium' to size='large' due to visual equivalence") is manually verified by pair programmers + designers — canonical wiki anchor for concepts/visual-equivalence-mapping, the information class source code alone cannot reveal. Production discipline: temperature=0 for reproducibility (concepts/temperature-zero-for-deterministic-codegen); <updatedContent> opaque output fencing (concepts/opaque-output-format-fencing) — same industry-convergent shape as Slack's <code></code> in the Enzyme→RTL codemod; 4K-output-token recovery via the conversation API and a literal "continue" prompt (concepts/continue-prompt-for-truncated-output) — "a simple 'continue' prompt proved more reliable than more complex prompts"; static/dynamic prompt partitioning for prompt-cache hits (concepts/static-dynamic-prompt-partitioning, patterns/prompt-cache-aware-static-dynamic-ordering) — static prefix first, <file>{file_content}</file> last; logical component grouping to keep context tokens in the 40–50K accuracy sweet spot (concepts/logical-component-grouping-for-context-budget, patterns/grouped-component-batched-migration); LLM-generated examples riding in the prompt and replayed in CI as prompt-regression tests (concepts/llm-generated-prompt-regression-test). Production numbers: ~90% accuracy, < $40 per code repository via GPT-4o pricing, 30–200 seconds per file. Axis 19 is the wiki's canonical instance of the LLM-only code-migration pipeline pattern — a structural contrast-pair with Slack's AST+LLM hybrid (same problem class, different application-layer choice: the human effort that would go into AST-rule authoring goes into mapping-verification instead) and a scale-sibling with Cloudflare's vinext framework rewrite (both LLM-authored code at production scale, different task shape — whole- rewrite vs bulk transformation). Tool-surface substrates: continue.dev for prompt authoring; llm library + OpenAI API
- GPT-4o as the runtime.
ZEOS probabilistic-forecast + black-box inventory optimisation on zFlow / hybrid online+offline serving (opened 2025-06-29) — the 2025-06 post (sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive) opens a new B2B partner-facing ML product axis downstream of axis 10 (Zalando Payments real-time inference on zFlow, 2021-02-15) and axis 11 (ML Platform / zFlow overview, 2022-04-18). Axis 20 is the third publicly-named zFlow workload and the first one explicitly combining batch + real-time delivery against the same feature store — the ZEOS Inventory Optimisation System runs a Demand Forecaster weekly (5M SKUs, 3-year sliding window, 12-week probabilistic horizon via LightGBM + Nixtla MLForecast, full pipeline < 2 hours) and a Replenishment Recommender daily batch + online interactive via Monte Carlo simulation + gradient-free black-box optimiser. Load-bearing canonicalisations: (a) the two-tier feature-engineering split (concepts/data-preprocessing-vs-data-transformation-split) — PySpark+Databricks horizontal tier + SageMaker Processing Job vertical tier; (b) SageMaker Feature Store in both online and offline modes with an explicit parity invariant (patterns/online-plus-offline-feature-store-parity) — first wiki instance of dual-mode feature-store architecture; (c) single SageMaker Training Job train-and-infer — lightweight LightGBM model bypasses the separate inference-hosting tier entirely, no checkpointing; (d) async SQS → Lambda → multi-threaded optimiser (patterns/async-sqs-lambda-for-interactive-optimisation) for interactive what-if from the partner portal, with side-effect write-back to the offline feature store so future batch runs stay consistent; (e) proactive cache of daily batch predictions — precompute for all merchants × articles so the dashboard read path is a KV lookup; (f) decoupled cadences — weekly forecast + daily optimise + online what-if (patterns/weekly-batch-forecast-daily-batch-optimise-cadence); (g) drift monitoring as a pipeline stage via SageMaker Processing Job + CloudWatch alarms + Lambda; (h) model choice rationale over deep-learning (TFT etc.) — lightweight footprint + ecosystem + rapid prototyping + conformal inference for probabilistic output. First wiki canonical instances of systems/zeos-inventory-optimisation-system, systems/zeos-demand-forecaster, systems/zeos-replenishment-recommender, systems/aws-sagemaker-feature-store, systems/sagemaker-processing-job, systems/sagemaker-training-job, systems/sagemaker-batch-transform-job, systems/mlforecast-nixtla, systems/lightgbm, systems/numba, and all 14 new concept + 8 new pattern pages listed above.
AI-powered postmortem analysis / fleet-insight mining (opened 2025-09-24) — the 2025-09 post (sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis) opens a lateral AI-assisted companion to axis 5's SRE discipline: a five-stage LLM map-fold pipeline ( Postmortem Analysis Pipeline) mines "thousands of archived postmortem documents" across the datastore Tech Radar — Postgres / AWS DynamoDB / AWS S3 / AWS ElastiCache / Elasticsearch — to surface recurring failure patterns and "investment opportunities" for engineering leadership. Five-stage architecture: Summarization (TELeR-shaped + refuse-on-ambiguity) → Classification ([[patterns/ negative-example-prompting|negative-example-shaped]] against surface-attribution) → Analyzer (3–5-sentence causal digest, human-curation pivot point) → Patterns (LLM fold over all digests into a one-pager) → Opportunity (human-authored investment proposal). Canonical first wiki instance of patterns/multi-stage-llm-pipeline-over-large-context as an alternative to single large-context prompting motivated by lost in the middle. Canonical first-wiki datums: (a) surface attribution error quantified at ~10% on Claude Sonnet 4, structural-not-scale; (b) hallucination rate evolution: small open-source 3B–12B at up to 40% → prompt-hardened + curated to < 15% → Claude Sonnet 4 "negligible"; (c) transition driver NotebookLM → LM Studio on-prem → AWS Bedrock is compliance / legal review, not capability; (d) human-curation glide path 100% during development → 10–20% random sampling at maturity with non-negotiable proofreading of the final Patterns-stage one-pager; (e) postmortems as data goldmine reframing of the archived corpus. Quantified 2-year outcome: automated change validation for infrastructure-as-code "shielded us from 25% subsequent datastore incidents"; AWS ElastiCache 80% CPU ceiling at peak surfaced as a strategic capacity-planning hotspot. Two new Postgres failure disclosures also surfaced by the pipeline (extending axis 13): a Postgres 12 AUTOVACUUM LAUNCHER race condition crashing connection pools and a Postgres 16→17 upgrade triggering a logical-replication memory-leak bug under parallel DDL + heavy transactions. Pipeline scope limited to public technologies only — Zalando-internal systems like Skipper produce "unacceptable analysis" and are flagged as the fine-tuning roadmap. Explicitly not an agentic solution: "the initial concept of a no-code agentic solution was quickly deemed unfeasible due to performance limitations, inaccuracies, and hallucinations encountered during prototype development" — the pipeline is the control structure, not an agent loop.
Rendering Engine + React Native mobile migration / brownfield RN at consumer-app scale (opened 2025-10-02) — the 2025-10 post (sources/2025-10-02-zalando-accelerating-mobile-app-development-with-rendering-engine-and-react-native) extends axis 7 (Rendering Engine) from web-only to cross-platform (web + iOS + Android) and opens a new architectural axis: brownfield React Native integration at consumer-app scale (52M+ customers, 90+ screens across two native codebases). Three load-bearing canonicalisations: (a) React Native as a package — RN root + init logic packaged as an npm Entry Point consumed by both a greenfield Developer App (standard RN toolchain, bundle-switching dev menu, mock interop contracts — unlocks web engineers for RN contribution) and a native Framework SDK (iOS + Android library exposing a simple ReactNativeViewFactory surface that the legacy app links like any other framework); generalised by Callstack's react-native-brownfield package. (b) Turbo Module + DI contract — three-language API contract (TypeScript + Swift + Kotlin) with a DI injection slot for the legacy app's implementation, so wishlist-badge style RN↔native interop doesn't couple the SDK to legacy source; lesson: "first properly define these API contracts … otherwise, you run into challenges where the API design might not be feasible on all platforms" (patterns/api-contract-first-across-three-languages). (c) Cross-platform UI via HTML-subset + tokens — react-strict-dom (HTML subset → RN primitives on mobile, plain HTML on web, zero runtime cost on web via build-time stripping) + StyleX + ZDS tokens for cross-platform theming, with Metro file-resolution (Foo.native.ts / Foo.ios.ts / Foo.android.ts) as the per-platform escape hatch and react-strict-dom's compat.native as the in-component escape hatch. Chosen over react-native-web on substrate-longevity + zero-runtime-on-web grounds. Validated at scale by the Discovery Feed migration — the new media-heavy front screen featured in Zalando's Q2 2025 financial results. Progressive adoption discipline: screen-by-screen with first screen low-traffic and simple for pipeline exercise before betting on a flagship. Structural contrast-pair with Shopify's greenfield team-by-team RN adoption (same "100% RN is not the goal" architectural posture at mixed-stack altitude, different starting-point inversion) and with Shopify's RN-new-architecture migration (both brownfield but at orthogonal altitudes: RN-architecture upgrade vs adding-RN- itself). First wiki canonical instances of systems/react-strict-dom, systems/stylex, systems/metro-bundler, systems/callstack-react-native-brownfield, systems/zalando-mobile-framework-sdk, systems/zalando-mobile-developer-app, systems/zalando-discovery-feed, systems/zalando-design-system-tokens, systems/react-navigation, systems/react-native-video, and all 9 new concept
- 10 new pattern pages.
Catalog-search self-inflicted DoS / per-caller observability & app-side admission control (opened 2025-12-16) — the 2025-12 post (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search) opens a standalone Search & Browse altitude on the wiki: the multi-layer Zalando Catalog Search substrate (Catalog API → NER query builder → Search API wrapping Base Search Elasticsearch; enrichment by Algorithm Gateway and Promotions Bidding; downstream by Zalando Assistant). Canonical first-wiki datums: (a) self-inflicted DoS — an internal app issuing 20–100 req/s (vs cluster baseline "thousands of req/s") of high- cardinality terms aggregations on the SKU field pinned coordinator CPU and starved the search thread pool, producing multi-market customer-visible "search is slow" / "filters are broken" for hours despite low volume by any standard metric. (b) The 5 reasons the incident hid enumerated verbatim as a diagnostic-gap template for zebra write-ups. (c) Zebra-not-horse heuristic — the investigator's bias-checker when the first-line horse-hypotheses have been eliminated and the symptom persists. (d) Adaptive Replica Selection named as ES's in-cluster analogue of power-of-two-choices, and specifically noted as unable to save the cluster because routing doesn't help when every shard copy is saturated. Three-piece follow-up program: (i) X-Opaque-Id client attribution → per-client slow-query dashboard (observability); (ii) app-side query limiting with dynamic thresholds (first-gate admission control); (iii) search.max_buckets cluster-wide guardrail (last-line defence). Incident-time mitigation via market split through node.attr.market allocation filters — the incident-time sibling of axis 18's steady-state market group isolation, applied at the ES storage tier rather than the PRAPI serving-API tier. Root cause identified via a Lightstep trace-exploration notebook that spotted the offending caller at 50× baseline fan-out — a trace-altitude rescue of a metric-altitude investigation. First wiki canonical instances of systems/zalando-catalog-search, systems/zalando-base-search, systems/zalando-catalog-api, systems/zalando-search-api, systems/zalando-ner-query-builder, systems/zalando-algorithm-gateway, systems/zalando-promotions-bidding, systems/zalando-assistant, concepts/self-inflicted-dos, concepts/high-cardinality-aggregation-overload, concepts/adaptive-replica-selection-elasticsearch, concepts/x-opaque-id-client-attribution, concepts/zebra-not-horse-heuristic, patterns/split-cluster-by-market-for-load-isolation, patterns/application-side-query-limit-with-dynamic-threshold, patterns/per-client-slow-query-dashboard, and patterns/cluster-wide-aggregation-guardrail.
Search quality assurance with AI as a judge / pre-launch market validation (opened 2026-03-16) — the 2026-03 post (sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge) opens the evaluation-side companion to axis 23's serving-side self-inflicted-DoS axis: Zalando's Search & Browse team shipped an offline LLM-as-a-judge evaluation framework (Search Quality Framework) built to validate catalog-search quality before launching into a new country with no user-data signal. Applied to Zalando's 2025 Luxembourg / Portugal / Greece launches. Load-bearing canonicalisations: (a) concepts/pre-launch-market-validation as a named problem class — "For an entirely new country, these signals are by definition not there yet"; the LLM-judge-shaped substitute for click-based QA. (b) NER-clustered query sampling via the Search Query Clustering pipeline — the NER engine's tag sets that cluster production queries for sampling also segment the output for diagnostic aggregation. (c) Visual-text relevance judgment — GPT-4o scores (query, product) pairs on a 0–4 rubric using product data + images as evaluation context with generalised reasoning (no per-attribute prompts). (d) Translated-query parity operationalised as cross-language NER-tag diffs — a diagnostic sidecar isolating NER-vocabulary issues from ranker / catalogue issues. Four disclosed PT violation shapes: lemmatisation drift ("desporto" / "desportivo" / "desportiva"), ambiguous collision ("tenis" / "ténis" vs tennis the sport), missing vocabulary ("menina", "meninas"), and multi-word term unrecognised ("fato de treino" tracksuit). (e) Segment-level root-cause diagnosis — three named failure classes surface as different segment-level patterns (incorrect product attributes; unrecognised NER terms; undiscoverable brand categories). Brand- wide failure example disclosed: BRAND=foo across 5 category segments all scoring 1.5–1.9 / 4.0. (f) (query, product) evaluation cache on ElastiCache, scoped only to evaluation tasks — collapses naive 5000 × 25 product-API + LLM calls to |unique products|; makes re-runs near-free and enables regression-detection on live markets. (g) Airflow TaskGroup parallelism — one TaskGroup per market, fan-out / fan-in with a consolidation task; each stage shipped as a Docker image via KubernetesPodOperator. Operational numbers disclosed: ~$250 per full run (GPT-4o completions dominate), 3–5 hours runtime, 1,500 segments × 25 results per market, 3 markets in parallel. Sibling to axis 16 ( Content Creation Copilot) on the AI-for-search/ AI-for-catalog axis — this axis targets validation rather than generation, and invokes GPT-4o as a judge rather than an attribute- extractor. Sibling to axis 18 ( Postmortem Analysis Pipeline) on the LLM-for-operations axis — this is Zalando's third publicly-disclosed production LLM system, after the copilot and the postmortem analyser. First wiki canonical instances of systems/zalando-search-quality-framework, systems/zalando-search-query-clustering, concepts/pre-launch-market-validation, concepts/ner-clustered-query-sampling, concepts/translated-query-parity, concepts/visual-text-relevance-judgment, concepts/segment-level-root-cause-diagnosis, concepts/query-product-evaluation-cache, concepts/ner-tag-parity-across-languages, concepts/automated-test-generation-from-production-traffic, concepts/airflow-taskgroup-parallelism, patterns/llm-as-judge-for-search-quality, patterns/ner-clustered-query-sampling-from-production, patterns/segment-level-relevance-dashboard, patterns/query-product-evaluation-cache, patterns/per-market-parallel-taskgroup-dag, patterns/translated-query-ner-parity-check, and patterns/podoperator-encapsulated-evaluation-job.
Flink state discipline / Table-API → DataStream-API multi-way-join rewrite on AWS Managed Flink (opened 2026-03-03) — the 2026-03 post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions) opens the stream-processing operational-cost axis, structurally adjacent to axis 23's Catalog Search Elasticsearch blast-radius axis but at a different subsystem: the Search & Browse team's Product Offer Enrichment Flink job on AWS Managed Flink 1.20. The original 4-way Table-API JOIN chain (offer + boost + sponsored + product events, keyed on SKU) hit state amplification — each join operator in Flink 1.x keeps its own independent RocksDB copy of both inputs for late-arrival correctness — and grew application state to 235–245 GB. Hourly savepoints became the dominant workload: "keep the cluster's CPU at 100% for nearly 12 minutes", producing backpressure, crash-restart loops, missed snapshots, and a permanent 10–20 % overscale margin on KPUs. The rewrite collapsed the chain into a single custom DataStream-API KeyedProcessFunction — union(...) → keyBy(SKU) → MultiStreamJoinProcessor — with [[patterns/single-valuestate-over-chained-joins|one ValueState[EnrichmentState] per SKU]] and event-time / content filtering before state.update to avoid redundant RocksDB writes. Impact: state 235 GB → 56 GB (−76 %), snapshot duration 11 min → 2.5 min (−77 %), CPU 100 %-spike → ~30 %-stable, restart time 12–20 min → 4–5 min, AWS cost −13 % (sub-proportional because KPUs bundle vCPU + RAM + storage 1+4+50 — the savings came from dropping the overscale margin, not from proportional KPU reduction). The closing note is that Flink 2.1's experimental MultiJoin operator (FLIP-516, 2025-05-19) implements the same idea natively ("2x to over 100x+ increase in processed records; 3x to over 1000x+ smaller state") but managed-runtime version lag forced the DataStream rewrite: "we're covered by our home-baked solution." The canonical framing — Flink SQL is perfect for 90 % of use cases; a software engineer's value is in recognising the remaining 10 % — takes its place as this wiki's declarative-vs-imperative stream-API stanza. First wiki canonical instances of systems/flink-table-api, systems/flink-datastream-api, systems/aws-managed-flink, systems/flink-multijoin-operator, systems/zalando-product-offer-enrichment, concepts/flink-stateful-join-state-amplification, concepts/flink-snapshot-savepoint, concepts/flink-keyed-stream-union, concepts/kpu-aws-managed-flink, concepts/multi-way-join-operator-flink, concepts/declarative-vs-imperative-stream-api, patterns/stream-union-plus-keyed-process-function, patterns/single-valuestate-over-chained-joins, and patterns/event-time-filter-for-state-write-reduction.
Skipper admission-time route validation / shift-left ingress correctness (opened 2026-04-08) — the 2026-04 post (sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time) opens the admission-path control-plane enforcement tier as the second Skipper-specific control-plane axis after axis 17. Zalando's Skipper extends Kubernetes's Ingress / RouteGroup with its own predicates, filters, and backend DSL that the API server's standard admission pipeline cannot validate — a typo like Headr("X-Canary", "true") instead of Header(...) is valid YAML but broken Skipper. Rather than build a second validator, Zalando deployed a Kubernetes validating admission webhook (ingress-admitter.teapot.zalan.do) that reuses Skipper's own filter registry, predicate specs, route parser, and backend parser to validate objects at admission time (patterns/reuse-runtime-logic-on-admission-path). The webhook propagates Skipper's literal error message through the Kubernetes deny response — "predicate 'NonExistingPredicate' not found" — so engineers fix the manifest in place instead of tracking down a broken route later from a support channel (concepts/shift-left-validation). Load-bearing canonicalisations: (1) concepts/validating-admission-webhook as a named primitive with the admission-pipeline diagram and the failurePolicy: Fail vs Ignore trade-off. (2) concepts/shift-left-validation as the general engineering stance of moving a correctness check from runtime to write-time. (3) concepts/control-plane-change-blast-radius as the specialised concepts/blast-radius framing for control-plane enforcement: a bad webhook freezes CI/CD fleet-wide without affecting live customer traffic — different risk class than data-plane changes. (4) concepts/feature-flag-rollback-for-validator as the control-plane-enforcement specialisation of concepts/feature-flag: the -enable-advanced- validation flag gates the new Skipper-specific validation layer on top of the pre-existing basic validation, so a false-positive rollback is a flag-flip, not a binary redeploy. (5) concepts/invalid-route-observability-metric canonicalised by name — skipper_route_invalid{route_id, reason} — as the per-tier rollout gate that distinguishes real user mistakes (which the webhook should reject) from validator bugs (which mean the webhook is wrong). (6) patterns/reuse-runtime-logic-on-admission-path as the architectural pattern, contrasted against writing a second validator (drifts) or using a policy engine (correct for cluster-wide rules, wrong for domain-DSL validation where the runtime already has a validator). (7) patterns/invisible-rollout-via-default-on-validation as the rollout shape — tier-by-tier default-on enablement such that teams writing valid manifests experience zero change, with the flag + metric + error message discipline making that invisibility safe. Also adds Seen-in to existing patterns/feature-flagged-dual-implementation (applied to validator rollout instead of RN architecture migration). Scale framing: 250+ clusters, 15k+ ingresses, ~200k routes, 500k–2M RPS with 1% invalid routes = ~2,000 broken routes = real production risk; upstreamed to open-source Skipper v0.24.18+.

Key systems¶

systems/zalando-logistics-portal — internal-facing portal built by the Transport teams inside the Logistics department (separate from the customer-facing Interface Framework stack). Composition primitive is Webpack Module Federation: 11 apps / 4 teams load as remotes into a single host shell with React / React-DOM as shared singletons. Four architectural seams — a centralised backend proxy as sole auth+authz gatekeeper; a manifest-driven /applications + manifest.json discovery chain with permission-scoped menus; a single prop passed from host to every remote for session / settings / logout / navigation / error hand-off; and a shared UI-kit as internal npm package. Canonical wiki instance of Module-Federation-based micro-frontends; second Zalando micro-frontend architecture on the wiki alongside IF.
systems/zalando-landing-pages-stack — Zalando's editorial-content platform serving campaign, sustainability, and category-inspiration pages on zalando.com. Built on Contentful (headless CMS) wrapped by the internal Contentful proxy, delivered via Interface Framework / Rendering Engine, aggregated via FSA GraphQL. Replaced legacy Project Mosaic-era editorial tooling in 2022. Canonical wiki instance of patterns/headless-cms-for-editorial-content + patterns/proxy-layer-for-external-saas + patterns/cross-surface-enrichment-via-internal-service; non-technical stakeholders own the end-to-end create/upload/publish flow; +82% YoY published pages, time-to-go-live 2 days → 4 hours.
systems/zalando-fashion-store-api — FSA, Zalando's production GraphQL aggregator (the UBFF's Fashion-Store deployment). The Rendering Engine issues GraphQL per-renderer to FSA; FSA fans out across Zalando-operated services only (external SaaS goes through internal proxies like the Contentful proxy). First dedicated wiki page promoted out of implicit references on earlier Zalando sources.
systems/zalando-content-proxy — the Contentful proxy: mapping + caching + resilience between FSA and Contentful. Canonical wiki instance of patterns/proxy-layer-for-external-saas; also the enrichment-join point where CMS-referenced IDs (certificate IDs) get resolved against Zalando's internal system of record.
systems/contentful — external headless CMS SaaS behind the Landing Pages stack. Selected on build-vs-buy grounds (scope risk of rebuilding CMS authoring UX) with custom-app extensibility + multi-language support + collaboration features as the decisive properties.
systems/zalando-marketing-roi-pipeline — the Performance Marketing org's batch data + ML ROI pipeline (Databricks Spark
Airflow, built partly on zFlow). The canonical motivating system for Zalando's per-PR pipeline environment via DAG versioning pattern (axis-new, 2022-06). Kubernetes operator; release 1.5 (2020) introduced the built-in PgBouncer connection-pooling feature that motivates the first ingest.
systems/zeos-inventory-optimisation-system — ZEOS (zeos.eu) B2B partner-facing AI-driven replenishment recommender. Two-pipeline architecture: systems/zeos-demand-forecaster (weekly probabilistic 12-week forecast for 5M SKUs via LightGBM + Nixtla MLForecast, < 2 h end-to-end on zFlow) + systems/zeos-replenishment-recommender (daily SageMaker Batch Transform + online SQS/Lambda interactive path sharing the same Monte Carlo + gradient-free black-box optimiser, fronted by the Zalando partner portal). Canonical first wiki instance of dual-mode SageMaker Feature Store (online + offline) with a parity invariant between the two delivery paths.
systems/zalando-route-server — Go proxy (github.com/zalando/skipper/routesrv) inserted between Skipper and the Kubernetes API as a coalescing control-plane proxy. Polls the API every 3 seconds for Ingress + RouteGroup, parses to Eskip, serves all Skippers in the cluster via HTTP + ETag / 304 (concepts/etag-conditional-polling). Canonical remediation for concepts/control-plane-fan-out-to-kubernetes-api at Zalando's 200-cluster, ~300-Skipper-per-cluster scale. Canonical instance of patterns/control-plane-proxy-with-etag-cache; rolled out via the three-mode False / Pre / Exec flag (patterns/three-mode-rollout-off-shadow-exec) with zero GMV loss.
systems/zalando-postmortem-analysis-pipeline — five-stage LLM map-fold pipeline (Summarization → Classification → Analyzer → Patterns → Opportunity) that mines "thousands of archived postmortem documents" across Postgres / DynamoDB / S3 / ElastiCache / Elasticsearch for recurring failure patterns. Current generation runs Claude Sonnet 4 on AWS Bedrock at ~30 s per postmortem, producing annual-scale cross-incident reports used for strategic infrastructure-investment decisions. Disclosed 2-year outcome: automated change validation shielded 25% of subsequent datastore incidents; AWS ElastiCache 80% CPU ceiling surfaced as a hotspot. Canonical wiki instance of patterns/multi-stage-llm-pipeline-over-large-context motivated by lost in the middle; residual surface-attribution error rate disclosed at ~10% even on Claude Sonnet 4. Contrasts with Content Creation Copilot (catalog-attribute extraction) as Zalando's second publicly-disclosed production LLM system and the first targeting operational decision support rather than customer-facing content.
systems/es-operator — Zalando-incubator Kubernetes operator for Elasticsearch; defines the ElasticsearchDataSet CRD (concepts/elasticsearchdataset-eds) and reconciles a StatefulSet underneath. Canonical instance of two partial- failure bug classes uncovered in the 2024-06-20 Lounge incident: concepts/context-cancellation-ignored-in-retry-loop (PR #405) and concepts/zombie-exclusion-list-state (WIP PR #423). Contrasts with zalando postgres operator (which uses StatefulSets similarly) and with PlanetScale Vitess Operator (which replaces StatefulSets entirely via patterns/custom-operator-over-statefulset).
systems/skipper-proxy — Go HTTP router / reverse proxy, default Kubernetes Ingress across 140+ clusters; reused to serve engineering.zalando.com via a single route annotation (compress() + setDynamicBackendUrl) that proxies to an S3 website endpoint.
systems/kube-ingress-aws-controller — auto-provisions AWS ALB + ACM cert per Ingress.
systems/external-dns — SIG Kubernetes controller used in combination with the above for end-to-end Ingress → ALB → DNS automation.
systems/octopus-zalando-experimentation-platform — in-house A/B testing platform; 2015–present; three subsystems (management, execution, analysis); analysis rebuilt on systems/apache-spark.
systems/randomizer-swift — open-source Swift library for randomised-input testing. Random protocol + Standard Library conformances + user-type extension point. Authored by Vijaya Kandel (Zalando Mobile, iOS). Used inside Zalando's iOS codebase.
systems/testcontainers — Zalando Marketing Services canonicalises the JVM / Java / Spring Boot altitude use pattern: singleton PostgreSQLContainer on a base class, @DynamicPropertySource-wired into Spring, amortised across all ITs in the JVM. Complements the existing Canva CI-framework altitude Seen-in.
systems/localstack · systems/mockserver · systems/wiremock · systems/ryuk-testcontainers-reaper — companion containers called out in the ZMS post.
systems/junit5 · systems/maven-surefire-plugin · systems/maven-failsafe-plugin · systems/spring-boot — the JVM test stack Zalando ZMS plugs Testcontainers into.
systems/zalando-load-test-conductor — Go microservice built by the Payments department (2021-03) to own the full lifecycle of an end-to-end load test: production-version cloning, multi-substrate scaling (Kubernetes + AWS ECS), KPI-closed-loop Locust steering, scale-down, and data cleanup. Exposes a declarative Kubernetes-inspired API; invoked both manually and via Kubernetes CronJob.
systems/locust · systems/hoverfly — the Payments department's chosen open-source traffic generator and API mocking tool. Locust over Vegeta / JMeter on developer- familiarity; Hoverfly over Wiremock / MockServer on record-and-replay + stateful behaviour + language-agnostic deployment.
systems/nakadi — Zalando's open-source event bus (Kafka wrapper with REST + schema registry); named in the 2021-03 post as a centrally-managed event queue whose test-cluster parity required cross-team alignment.
systems/opentracing — the distributed tracing substrate powering concepts/adaptive-paging (from the 2020-10 Cyber-Week retrospective). Part II (2021-09-20) names Zalando-specific Semantic Conventions + a Tracing API to consume trace data as the two load-bearing artifacts that enable every tracing-derived ops primitive (Adaptive Paging, Throughput Calculator, Operation-Based SLOs).
systems/zalando-adaptive-paging — Zalando's alert handler monitoring CBO error rate and traversing the live trace graph to page the team closest to the root cause. First named 2020-10-07, canonicalised 2021-09-20; presented at SRECon'19 EMEA by Mineiro.
systems/zalando-throughput-calculator — projects per- downstream-service RPS from expected CBOs/min using tracing fan-out data; feeds Load Test Conductor for Cyber Week capacity planning.
systems/zalando-slo-reporting-tool — DX-scoped SLO reporting with canonical SLIs + Service Tier classification; built 2018 for the Service Tier rollout, superseded in mindset by Operation-Based SLOs in 2019.
systems/zalando-service-level-management-tool — the operation-based SLO platform built 2021–2022 to succeed the older SLO Reporting Tool; per-CBO error-budget visualisation across three 28-day windows; drives Adaptive Paging's MWMBR trigger. First named 2022-04-27 in the Operation-Based SLOs post.
systems/zalando-graphql-ubff — Zalando's single-service Unified Backend-For-Frontends GraphQL; in production since end of 2018; 12+ domains, 200+ consuming developers, 25-30 feature teams; >80% Web / >50% App coverage.
systems/graphql-jit — Zalando's in-house open-source JIT-compiled GraphQL executor (zalando-incubator/graphql-jit), the execution engine the UBFF runs on.
systems/graphql — the query-language substrate.
systems/zalando-interface-framework — second-generation frontend platform; designed 2018, ~90% of zalando.com traffic by March 2021. Supersedes the Mosaic Fragment architecture with an entity-based page-composition model.
systems/zalando-mosaic — the 2015-era Fragment-based micro-frontend architecture Zalando retrospectively critiques; retained via hybrid Rendering-Engine modes during the migration to IF.
systems/zalando-rendering-engine — the Node.js + browser runtime at the heart of IF; resolves Entity trees into Renderer trees using declarative rendering rules.
systems/zalando-appcraft — Zalando's server-driven mobile UI framework (2018-present), replacement for the 2016-era Truly Native Apps (TNA) Composed-Tiles framework. Adopts the Elm architecture on the client, Flex on top of Texture (iOS) / Litho (Android), a narrow primitive vocabulary (Label, Button, Image, Video, Layout), server-owned schema versioning, and deep-link indirection for new-screen creation. Serves 13 dynamic pages in the Zalando app (2024) including Zalando Stories. Load-bearing release-boundary rule: "a client-release is only required when there's a need to introduce a new primitive or extend the contract of an existing one" (concepts/client-release-needed-only-for-new-primitives). Canonical instance of same-day UI delivery on mobile. Coexists with the 2025-era React Native path. Testing surfaces: Appcraft Browser (demo harness) + PR-pinned debug-app. Referenced 2021-09 micro-frontends part-2 post as its backend companion.
systems/zalando-truly-native-apps-tna — 2016-era predecessor to Appcraft. JSON-driven, vertical-list container of high-level Composed Tiles ("Showstopper Tile" with Version C/D variants). Retired because small UI changes required client releases and schema versioning required coordinated server + both-clients deploys. Decommissioned in favour of Appcraft's primitives + Flex
server-owned versioning design.
systems/zally — Zalando's open-source OpenAPI linter that codifies the RESTful API Guidelines. MUST-severity rules gate CI builds — the enforcement point of API-first at Zalando.
systems/fabric-gateway-zalando — Zalando's declarative Kubernetes API gateway built on top of Skipper; one of the three default AuthN/AuthZ options for new backend services.
systems/opentracing-toolbox — Zalando's Java/Kotlin integration library for OpenTracing; named as the Kotlin Guild's default tracing library with a dedicated Kotlin submodule.
systems/zalando-mdm-system — Zalando's in-house Master Data Management component (in-design as of mid-2021). Uses a knowledge graph in Neo4j as the design- time authoring substrate; Python script walks the graph to generate the logical data model of the golden record plus per-source-system transformation data models. Consolidated- style MDM scoped to "tens of tables and hundreds of columns".
systems/neo4j — the property-graph database used as the knowledge-graph store and visualisation tool for the MDM modeling work. Chosen explicitly for "best look-and-feel" / domain-expert communication, not query-path semantics.
systems/postgis — the PostgreSQL geospatial extension; de facto open-source standard for spatial data. Zalando's Postgres Operator team uses it for Mapbox-Vector-Tile production via ST_AsMVT / ST_HexagonGrid directly from the database, installed declaratively into a named schema via the operator's preparedDatabases field.
systems/pg-tileserv — Crunchy Data's lightweight HTTP vector-tile server; the thin shim Zalando pairs with PostGIS to serve {z}/{x}/{y}.pbf tiles straight from the database, with the full tile pipeline living in PostGIS functions.
systems/leaflet — the JavaScript library Zalando uses as the browser front-end for maps over a systems/postgis + systems/pg-tileserv stack, consuming vector tiles via the VectorGrid class.
systems/openstreetmap — the free wiki-style basemap Zalando stacks PostGIS-served overlay layers on top of. the zalando postgres operator runs; ships with PostGIS pre-compiled, enabling the declarative geospatial-Postgres-on-Kubernetes manifest shape.
systems/zflow — Zalando ML Platform's Python DSL for ML pipelines; in use since early 2019; hundreds of pipelines authored; compiles via AWS CDK to CloudFormation → Step Functions state machines invoking SageMaker training / batch transform / endpoints and systems/databricks Spark jobs. Operated by two central teams.
systems/datalab-zalando — Zalando's internal brand for its hosted JupyterHub + R Studio experimentation environment with pre-wired access to S3, BigQuery, MicroStrategy, and other internal data sources; positioned as the "ready in less than a minute" entry point for applied scientists.
systems/zalando-hpc-cluster — GPU high-performance computing cluster; SSH-accessible; used for computer vision and large-model training that notebook / Spark substrates cannot handle; co-operated with Datalab by a single central team.
systems/zalando-ml-portal-backstage — the ML-pipeline-observability surface built on systems/backstage; shows real-time pipeline execution, per-run metric-evolution graphs, and model cards. Co-owned with zflow by the two monitoring teams.
systems/backstage — Spotify's open-source developer- portal platform; first on-wiki reference; Zalando uses it as the substrate for its internal developer portal, including the ML portal plugin.
systems/aws-step-functions · systems/aws-lambda · systems/aws-sagemaker-ai · systems/aws-sagemaker-endpoint · systems/sagemaker-inference-pipeline-model · systems/databricks · systems/cloudformation · systems/aws-cdk · systems/xgboost — the managed-service surface Zalando zflow targets. Named jointly in the 2021-02-15 workload retrospective + 2022-04-18 platform overview as Zalando's canonical ML-pipeline runtime stack.

Key concepts and patterns surfaced¶

Postgres-on-Kubernetes / kernel-level latency axis: observation of non-uniform kube-proxy load distribution. first-person reproduction with perf softirq tracepoints. fix. multi-core scaling primitive. reproducible benchmark recipe. Zalando Operator's default topology. the documented escape hatch. - patterns/static-site-via-ingress-proxy-to-s3-website — Skipper Ingress + S3 website endpoint as a CloudFront alternative when the ingress platform is already operated. - concepts/git-based-content-workflow — Zalando's PR-driven engineering blog publishing model. - concepts/reuse-existing-infrastructure-over-purpose-built-service — the explicit reasoning behind choosing Skipper over CloudFront for the blog.

Experimentation-platform axis:

concepts/experimentation-evolution-model-fabijan — the crawl/walk/run framework the entire Zalando evolution post is structured around.
concepts/sample-ratio-mismatch — Zalando's 20%+ historical SRM rate vs 6–10% peer baseline.
concepts/experimentation-culture — five org-level moves (integration, training, embedded owners, internal blogs, consultation hours) to build data-driven decision-making.
concepts/ab-test-design-audit — 5-dimension pre-launch quality review.
concepts/non-inferiority-test — identified in peer review as an improvement area over the default two-sided t-test.
concepts/quasi-experimental-methods — causal-inference tools for use cases that can't be cleanly A/B-tested (country comparisons).
concepts/overall-evaluation-criterion — team-specific + LTV-proxy guidelines for KPI selection.
patterns/centralized-experimentation-platform — platformise A/B testing instead of team-by-team manual setup.
patterns/controlled-rollout-with-traffic-rampup — gradual traffic exposure + feature toggles as Octopus platform primitives.
patterns/open-source-wrapped-by-production-system — Octopus's inaugural architectural decoupling between Python stats library and Scala backend.
patterns/automated-srm-alert — auto-alert + dashboard gating as Octopus's SRM remediation.

Mobile testing discipline axis:

concepts/example-based-test-constant-input-antipattern — the failure mode Kandel names: hand-typed constants in test inputs make tests near-tautological against hard-coded implementations.
concepts/type-class-driven-random-generator — the generator- dispatch mechanism: T.random resolves per type via protocol conformance; user types compose via per-field delegation.
patterns/property-based-testing — extended with Zalando's Swift iOS implementer-altitude Seen-in; entry-level form of the pattern (one permutation per run, no shrinker, no seed replay).

JVM integration testing discipline axis:

concepts/test-pyramid — Fowler/Cohn shape; Zalando ZMS's empirical ratio ≈ 25% integration tests to unit tests varies per app.
concepts/first-test-principles — FIRST (Fast, Isolated, Repeatable, Self-Validating, Thorough) as the property contract ITs must still satisfy on shared containers.
concepts/h2-vs-real-database-testing — the antipattern Testcontainers replaces; 4 s Postgres-in-Docker vs 0.4 s H2.
concepts/singleton-container-pattern — static container on an abstract base class, started once per JVM via static initialiser, inherited by every subclass IT.
concepts/contract-testing — the gap Testcontainers alone doesn't close; Zalando names Spring Cloud Contract as the complement.
patterns/real-docker-container-over-in-memory-fake — prefer real Postgres / Localstack / MockServer over H2 / embedded-fakes / mocks-in-JVM.
patterns/failsafe-integration-test-separation — Maven Surefire (unit, test phase) + Failsafe (IT, integration-test phase) gated by a with-integration-tests profile.
patterns/shared-static-container-across-tests — the ZMS-canonical AbstractIntegrationTest idiom realising the singleton-container pattern.

Cyber-Week prep / load-test automation axis:

concepts/declarative-load-test-api — the Kubernetes- inspired API style the Payments department chose for the Load Test Conductor ("Executing a load test is now just one API call away!").
concepts/kpi-driven-load-ramp-up — the 60-second closed-loop algorithm keyed on orders-per-minute rather than a fixed users→orders ratio.
concepts/header-based-mock-switching — how the test cluster lets a single service deployment serve both mocked- dependency (load-test) and real-dependency (integration- test) traffic per request.
concepts/production-version-cloning-for-load-test — the Deployer + Scaler subcomponent invariant: match production version + replica count + resource allocation.
concepts/test-cluster-as-break-things-environment — the complement to patterns/live-load-test-in-production; a non-prod cluster deliberately pushed past failure.
concepts/adaptive-paging — from the 2020-10 Cyber-Week retrospective; OpenTracing-based alert-routing that pages the team closest to a fault.
concepts/sre-organizational-evolution — three-phase (grassroots → tracing + capacity → dedicated department) SRE maturation named in the 2020-10 post.
patterns/live-load-test-in-production — the in-prod discipline (2020-10 post) that the 2021-03 conductor's break-things cluster complements.
patterns/declarative-load-test-conductor — the pattern abstracted from Zalando's Load Test Conductor.
patterns/kpi-closed-loop-load-ramp-up — the algorithm pattern.
patterns/mock-external-dependencies-for-isolated-load-test — the mocking-layer pattern.
patterns/header-routed-mock-vs-real-dependency — the per-request switching pattern.
patterns/scheduled-cron-triggered-load-test — the Kubernetes-CronJob-driven recurring-run pattern.
patterns/annual-peak-event-as-capability-forcing-function — Cyber Week as the organisational forcing function that funds the load-test infrastructure investment.

Unified GraphQL BFF / API platform axis:

concepts/backend-for-frontend — the pattern Zalando adopted in 2015 alongside microservices and replaced in 2018 with the UBFF; the wiki's canonical account of the five BFF pathologies at e-commerce scale.
concepts/unified-graph-principled-graphql — the "one graph" discipline from Principled GraphQL; Zalando's 2021 post is the wiki's canonical discussion of this concept across six peer industry instances (GitHub, Shopify, Airbnb, Expedia, Netflix, Zalando).
concepts/conways-law — named by Zalando as the root cause of cross-BFF inconsistency; the UBFF is an Inverse-Conway remediation.
patterns/unified-graphql-backend-for-frontend — the single-service (non-federated) implementation of the unified-graph discipline; Zalando UBFF is the canonical wiki instance.
patterns/business-logic-free-data-aggregation-layer — Zalando's "No Business Logic" principle; aggregation layer is platform- and domain-agnostic, presentation- layer backends own business logic.
patterns/per-platform-deployment-bulkhead — Bulkhead applied at deployment level: separate Web + mobile-App instances of the same service, canonical at Zalando UBFF.
patterns/graphql-unified-api-platform — Zalando UBFF extends the umbrella pattern's canonical instance list alongside Twitter and Netflix.
concepts/graphql-persisted-queries — the build-time register-and-swap mechanism; Zalando runs it in gate mode (unknown-IDs-rejected), not Apollo's cache mode.
concepts/draft-schema-field — the @draft directive, stage 1 of the three-stage field lifecycle.
concepts/component-scoped-field-access — the @component / @allowedFor directive pair, stage 2 of the lifecycle; restricts experimental fields to named UI components to keep breaking-change blast radius small.
concepts/graphql-schema-usage-observability — the exact field-level usage observability that falls out of gate-mode persisted queries.
patterns/automatic-persisted-queries — mechanism class; Zalando UBFF is the canonical gate-mode instance, Apollo APQ is the cache-mode peer.
patterns/disable-graphql-in-production — the counter-intuitive framing: production endpoint accepts only query IDs, not raw GraphQL.
patterns/directive-based-field-lifecycle — umbrella pattern binding @draft → @allowedFor → stable as a three-stage schema-governance discipline.
concepts/graphql-schema-directive · concepts/graphql-query-directive — the two directive classes the UBFF's governance discipline rides on.
patterns/directive-based-field-authorization — @isAuthenticated(acrValue:) with step-up ACR values.
patterns/directive-based-pii-redaction — @sensitive argument-marker plus keyword-based schema linter for log/trace redaction.
patterns/directive-based-http-endpoint-partitioning — @requireExplicitEndpoint for per-path GraphQL sub-surfaces.
patterns/directive-driven-entity-codegen — @resolveEntityId → TypeScript codegen + runtime entity:<type>:<sku> wrapping.
concepts/final-enum · concepts/extensible-enum — the enum-governance pair.

JVM language governance / Kotlin ADOPT ring axis:

concepts/tech-radar-language-governance — ring-based (ASSESS/TRIAL/ADOPT/HOLD) governance for language and framework lifecycle; Zalando's public Tech Radar (opensource.zalando.com/tech-radar) is the canonical wiki instance.
concepts/api-first-principle — OpenAPI-first contract design paired with build-time linter enforcement; Zalando operationalises it via Zally + published RESTful API Guidelines + the central API portal.
patterns/template-project-nudges-consistency — bootstrap new services from a pre-wired template that "nudges teams towards higher consistency across different services and departments." The Kotlin Guild's template-ready deliverables are what graduated Kotlin from TRIAL to ADOPT.

MDM / knowledge-graph-driven data modeling axis:

concepts/master-data-management — the enclosing discipline; "technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets." Zalando chose consolidated-style MDM.
concepts/golden-record — "a common, shared, and trusted view on data for a particular domain"; the output of consolidation over source systems.
concepts/logical-data-model — the schema of the golden record; generated from the knowledge graph rather than authored directly.
concepts/transformation-data-model — per-source-system mapping showing direct (1-to-1) vs. indirect (1-to-many, transformation-function) column → concept mappings. The worked System A / System B Address example (free-text address lines vs. structured street / zip / city / country_code fields) illustrates both mapping types.
concepts/semantic-layer-of-business-concepts — the graph of Concept / Attribute / Relationship nodes between source schemas and the target logical data model; the "shared conceptual vocabulary" that makes business- engineering alignment tractable.
concepts/knowledge-graph — extended with Zalando MDM as its third canonical wiki instance (alongside Dropbox Dash retrieval substrate and Netflix UDA enterprise-data- integration substrate). A new H2 in the knowledge-graph concept page contrasts the three altitudes.
concepts/data-lineage — extended with Zalando MDM as a design-time byproduct Seen-in, complementary to the existing Meta (enforcement-primitive) and Redpanda (agent-interaction envelope) framings.
patterns/knowledge-graph-for-mdm-modeling — the core pattern; System / Table / Column / Concept / Attribute / Relationship node schema, Python generator, direct vs. indirect mappings.
patterns/mapping-driven-schema-generation — the generalised pattern across MDM, Netflix UDA, and dbt-style data-build tools: make the mapping authoritative and derive both target schema and transformation code.
patterns/visual-graph-for-business-engineering-alignment — Neo4j-rendered graph diagrams as the primary business- engineering communication artifact, replacing SQL DDL / spreadsheets. Named by the post as the #1 benefit.

Supply-chain / SBOM-driven dependency governance¶

systems/syft — Anchore's SBOM generator; runs on every Zalando deploy's container image and emits a CycloneDX / SPDX document that feeds the data-lake corpus.
systems/grype — Anchore's vulnerability scanner over syft-generated SBOMs; powers the CVE-correlation layer on top of the SBOM corpus.
systems/cyclonedx · systems/spdx — the two canonical SBOM formats; Zalando names both as "common formats" for portability + tooling integration.
systems/dependabot · systems/scala-steward — per-repo dependency-update bots named by the Zalando post as the tactical complement to the fleet-wide SBOM corpus.
systems/log4j — the canonical Log4Shell forcing function cited by the post as the defining mass-patch event the SBOM platform was built to handle.

Postgres event-streaming platform / driver-layer fixes¶

systems/zalando-postgres-event-streams — Zalando's low-code platform for declaring Postgres-sourced event streams; each declared stream provisions a dedicated Debezium Engine micro- application. Scale at publication time: hundreds of streams in production.
systems/debezium-engine — the embedded-library mode of Debezium; the shape Zalando chose over Kafka-Connect-hosted Debezium.
systems/pgjdbc-postgres-jdbc-driver — the Postgres JDBC driver Zalando patched to respond to KeepAlive messages with LSN acks; fix merged upstream as pgjdbc PR #2941 in 42.7.0.

Recent articles¶

[2026-04-08] Rejecting Invalid Ingress Routes at Apply Time — sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time
[2026-03-16] Search Quality Assurance with AI as a Judge — sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge
[2026-03-03] Why We Ditched Flink Table API Joins: Cutting State by 75% with DataStream Unions — sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions
[2025-12-18] Contributing to Debezium: Fixing Logical Replication at Scale — sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale — two-year sequel to axis 13's 2023-11 pgjdbc upstream fix. After Zalando ran its Fabric Event Streams platform on pgjdbc 42.7.2's keepalive-LSN flush for nearly two years processing billions of events with zero detected data loss, Debezium hard-disabled the feature globally via PR #6472 (withAutomaticFlush(false)) — breaking Zalando's upgrade path. The remediation: two upstream contributions to Debezium 3.4.0.Final (released 2025-12-16). (a) lsn.flush.mode (DBZ-9641 / PR #6881) — three-mode enum (manual / connector default / connector_and_driver) making the pgjdbc flush opt-in per deployment (patterns/opt-in-driver-level-lsn-flush). (b) offset.mismatch.strategy (DBZ-9688 / PR #6948) — four-strategy enum (no_validation default / trust_offset / trust_slot / trust_greater_lsn) letting operators pick whether the slot or the offset store is authoritative on startup mismatch (patterns/authoritative-slot-over-authoritative-offset). Load-bearing architectural insight canonicalised: logical- replication position is tracked in two independent locations and the right reconciliation depends on operator-side invariants (HA substrate + offset-store choice + deployment track record) Debezium cannot know. Zalando trusts the slot because Patroni-managed Postgres has slot-survives-failover discipline since the mid-2010s plus MemoryOffsetBackingStore since 2018; most Debezium users trust the offset store because Kafka Connect offset topics are durable ground truth. Both properties ship with boolean → enum auto-mapping (flush.lsn.source → lsn.flush.mode; internal.slot.seek.to.known.offset.on.start → offset.mismatch.strategy). Updated platform scale: "hundreds of event streams" processing "hundreds of thousands of events per second" across 100+ Kubernetes clusters at peak. First wiki canonical instances of systems/patroni as a system page, concepts/lsn-flush-mode, concepts/offset-mismatch-strategy, concepts/slot-vs-offset-position-tracking, concepts/memory-offset-backing-store, patterns/opt-in-driver-level-lsn-flush, patterns/authoritative-slot-over-authoritative-offset, and patterns/backward-compatible-config-migration.
[2025-12-16] The Day Our Own Queries DoS'ed Us: Inside Zalando Search — sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search
[2025-10-02] Accelerating Mobile App development at Zalando with Rendering Engine and React Native — sources/2025-10-02-zalando-accelerating-mobile-app-development-with-rendering-engine-and-react-native
[2025-09-24] Dead Ends or Data Goldmines? Investment Insights from Two Years of AI-Powered Postmortem Analysis — sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis
[2026-01-14] Paper Announcement: A Practical Approach to Replenishment Optimization with Extended (R, s, Q) Policy and Probabilistic Models — sources/2026-01-14-zalando-paper-announcement-replenishment-optimization-extended-rsq. Blog companion to Zalando's Nature Scientific Reports publication on the algorithmic core of the ZEOS Inventory Optimisation System — the algorithm-shape companion to the 2025-06-29 architecture deep-dive. Discloses four load-bearing algorithmic choices the earlier post summarised verbally but didn't name: (1) Extended (R, s, Q) policy — classical reorder-point policy extended with kick-start quantity Q₀ at time t₀ (for launch-phase aggressiveness) and lifecycle cutoff t_limit (for decay-phase conservatism); parameter vector θ = (t₀, Q₀, s, Q). (2) Discrete Event Simulation (DES) over a 12-week horizon with explicit intra-week event ordering (inbound/2 → fulfilment → inbound/2 → reorder check → cost accrual) and Gamma-distributed replenishment lead times sampled per run. (3) [[concepts/percentile-objective-optimisation|75th-percentile cost minimisation]] — not expected-value; explicit tail-risk protection. (4) Counterfactual stockout demand — unfulfilled demand during stockouts sampled from the probabilistic forecast distribution rather than zeroed. Also canonicalises: five-pillar cost decomposition (holding + inbound + outbound + returns + lost-sales, where lost-sales is return-rate-adjusted), and LightGBM quantile regression (not conformal-inference wrapping) as the forecast mechanism. Empirical validation via computational backtest — ~2M articles × ~800 merchants for Oct 2023–Sep 2024 — against professional human replenishment decisions: +22.11% GMV, +21.95% Gross Margin, +33.63% weighted weekly availability (to 86.40% absolute), +23.63% weighted demand fill rate (to 91.14% absolute), with 70–80% of merchants seeing positive uplift. Theoretical-100%-adoption caveat explicitly called out: the tool is decision-support, merchants retain final authority, realised uplift depends on adoption rate. Baseline comparison shows Extended (R, s, Q) beats Tuned (s, S) (+8.72pp GMV), Periodic base-stock (+9.61pp), and Myopic Newsvendor (+17.04pp) under identical data and simulation settings — Zalando's framing: "even the Tuned (s, S) policy, which is a common industry standard, falls short because its static thresholds cannot match the responsiveness of our extended (R, s, Q) variables (Q₀ and t_limit) in a high-variance environment." Ablation study decomposes forecast type × objective type: probabilistic forecast is the first-order lever (+15.74pp GMV over point-forecast baseline, same P75 objective); percentile objective is the second-order stability lever (+3.09pp GMV over mean objective, same probabilistic forecast) — canonicalised on wiki as patterns/probabilistic-forecast-plus-percentile-objective ("need both") and patterns/des-plus-gradient-free-optimiser-under-uncertainty (architectural pairing for decision-under-uncertainty at scale). Extends axis 20 (ZEOS / probabilistic-forecast + black-box inventory optimisation) — the 2025-06 post was the platform-shape disclosure; this 2026-01 post is the algorithm-shape disclosure. Introduces 6 new concepts (Extended (R, s, Q), DES, Percentile objective, Counterfactual stockout demand modelling, Computational backtest, Ablation study forecast × objective) + 2 new patterns (Probabilistic forecast
Percentile objective, DES + Gradient-free optimiser under uncertainty) and extends the three ZEOS system pages + three existing concept pages (Monte Carlo / Probabilistic forecast / Gradient-free black-box optimisation). First wiki canonicalisation of DES, P75 cost objective, Extended (R, s, Q), counterfactual stockout-demand modelling, and LightGBM quantile regression for fashion commerce demand forecasting.
[2025-06-29] Building a dynamic inventory optimisation system: A deep dive — sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive
[2025-02-19] LLM powered migration of UI component libraries — sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries
[2025-02-16] Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster — sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster
[2024-09-17] Content Creation Copilot - AI-assisted product onboarding — sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding
[2024-01-22] Tale of 'metadpata': the revenge of the supertools — sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools
[2023-11-08] Patching the PostgreSQL JDBC Driver — sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver
[2023-10-18] Understanding GraphQL Directives: Practical Use-Cases at Zalando — sources/2023-10-18-zalando-understanding-graphql-directives-practical-use-cases-at-zalando
[2023-07-10] Rendering Engine Tales: Road to Concurrent React — sources/2023-07-10-zalando-rendering-engine-tales-road-to-concurrent-react
[2022-04-27] Operation Based SLOs — sources/2022-04-27-zalando-operation-based-slos
[2022-04-18] Zalando's Machine Learning Platform — sources/2022-04-18-zalando-zalandos-machine-learning-platform
[2021-10-14] Tracing SRE's Journey in Zalando - Part III — sources/2021-10-14-zalando-tracing-sres-journey-part-iii (Phase-2 → Phase-3 SRE organizational completion via late-2019 reorg: monitoring teams + Incident Management + SRE Enablement fold into one SRE department. 2020 SRE Strategy anchored on Observability standardisation via language-specific SDKs per Tech Radar. Four process/ product moves: anomaly vs incident separation in the incident process; MWMBR-derived alert rules replacing SLO-threshold paging in concepts/adaptive-paging; Error-Budget-aware Service Level Management tool; SRE Curriculum video+quiz modules folded into onboarding. First Zalando embedded SRE team for Checkout via customer pull with dual KPIs (Availability + On-Call Health). First documentation of the SRE KPI portfolio — incident count, MTTR, false positive rate, customer impact — and of concepts/on-call-health-metric as a first-class KPI.)
[2021-03-10] Micro Frontends: from Fragments to Renderers (Part 1) — sources/2021-03-10-zalando-micro-frontends-from-fragments-to-renderers-part-1
[2020-06-23] PgBouncer on Kubernetes and how to achieve minimal latency —