Skip to content

Zalando

What they do

Zalando SE is Europe's largest online fashion and lifestyle platform, headquartered in Berlin. Beyond commerce, Zalando is known in the systems community for heavy open-source contribution around Postgres-on-Kubernetes and for in-house experimentation platform work:

  • Postgres Operator — open-source Kubernetes operator for managing Postgres clusters with high availability, connection pooling, and backups.
  • Skipper — HTTP router and reverse-proxy written in Go, used as the default Kubernetes Ingress proxy across 140+ Zalando Kubernetes clusters (Source: ).
  • Kubernetes Ingress Controller for AWS — Zalando-incubator controller that provisions an AWS ALB + ACM cert for each Ingress.
  • Patroni — HA template for PostgreSQL (via Python + DCS), used industry-wide.
  • Spilo — Docker image for running Postgres with Patroni.
  • Octopus — in-house A/B testing / experimentation platform, released 2015; analysis system rebuilt on systems/apache-spark over ~2 years. Source of Zalando's crawl/walk/run experimentation- platform retrospective ().

Zalando Engineering (engineering.zalando.com) is a Tier-2 source on the sysdesign-wiki: consistent output on distributed systems, Kubernetes, Postgres internals, cloud platform engineering, and experimentation / A/B-testing infrastructure, though the blog mixes in recruiting and product-announcement posts.

Wiki anchor axes

Zalando has twenty-eight canonical axes on the wiki. Axis 6 is a four-post series (2021-03 UBFF + 2021-04 errors + 2022-02 persisted-queries/schema-stability + 2023-10 directive-taxonomy). Axis 7 is now a four-post series (2021-03 micro-frontends + 2023-06 context-based Experience + 2023-07 concurrent-React + 2025-10 Rendering Engine on React Native) — the 2023-06 Experience installment adds request- scoped presentation context as an orthogonal IF primitive: a named Experience bundle of policies + selection rules resolved once at root-entity time and propagated via the Rendering Engine request state to every child-renderer FSA call (patterns/request-state-propagated-experience). Axis 5 is a six-part narrative (2020-10 Cyber Week + 2021-03 load-test automation + 2021-09 SRE tracing I + 2021-09 SRE tracing II + 2021-10 SRE tracing III + 2022-04 operation-based SLOs). Axis 14 opens with the 2024-01 metadpata postmortem — canonicalising supertools and the 5-layer destructive-automation blast-radius containment stack. Axis 15 (new, 2024-07) opens the end-to-end test probe tier as a direct downstream chapter of axis 5 — browser-altitude synthetic monitoring of CBOs via Playwright on a 30-minute cron. Axis 16 (new, 2024-09) opens the AI-assisted content onboarding / catalog-attribute copilot tier — Zalando's Content Creation Copilot behind a four-service aggregator contract (Content Creation Tool + Article Masterdata + Prompt Generator + OpenAI VLM backend), migrating launch-phase GPT-4 Turbo to GPT-4o with no downstream contract changes. Axis 17 (new, 2025-02) opens the ingress control-plane scaling / coalescing proxy tier downstream of axis 1 and the 2020-06-30 ingress-stack launch — Zalando fronts ~300 Skippers per cluster with a single Route Server that polls the Kubernetes API every 3 s and serves all Skippers via HTTP + ETag / 304, turning an N×-fan-out overload on etcd into a 1× poll + N× cheap-304 delta channel. Canonical wiki instance of concepts/control-plane-fan-out-to-kubernetes-api and patterns/control-plane-proxy-with-etag-cache; rolled out via three-mode off/shadow/exec. Axis 26 (new, 2026-04) opens the Skipper admission- time validation tier as the second Skipper-specific control-plane tier after axis 17 — Zalando plugs Skipper's own filter registry, predicate specifications, route parser, and backend parser into a Kubernetes validating admission webhook (ingress-admitter.teapot.zalan.do) so that kubectl apply on an Ingress or RouteGroup with an unknown predicate, bad filter arguments, or unparseable backend is rejected with the same error Skipper would give at request time. The architectural move is to treat Skipper's validator as a library and call it from both the admission path and the request path, so the two answers cannot drift (patterns/reuse-runtime-logic-on-admission-path). Scale framing: 250+ clusters, 15k+ ingresses, ~200k routes, 500k– 2M RPS — at that scale even 1% invalid routes is ~2,000 broken routes and real production risk. Rolled out tier-by-tier behind -enable-advanced-validation (concepts/feature-flag-rollback-for-validator) guided by the skipper_route_invalid{route_id, reason} metric (concepts/invalid-route-observability-metric), such that teams writing valid manifests observed no change whatsoever — canonical invisible rollout. Zavodskikh's test for the rollout shape passed when, asked at an internal conference how to enable it, the answer was "you don't need to — it's already on." Blast-radius class is control-plane-on-writes, not data-plane-on-traffic — a bad webhook would freeze CI/CD fleet-wide while live customer requests kept flowing on the old routing tables (concepts/control-plane-change-blast-radius). Upstreamed to open-source Skipper v0.24.18+. Axis 18 (new, 2025-09) opens the AI-powered postmortem analysis / fleet-insight-mining tier as a lateral companion to axis 5's SRE discipline — Zalando's datastore team runs a five-stage LLM map-fold pipeline ( Postmortem Analysis Pipeline) over "thousands of archived postmortem documents" covering Postgres / DynamoDB / S3 / ElastiCache / Elasticsearch, producing cross-incident failure- pattern reports and "investment opportunity" proposals for engineering leadership. Canonical wiki instance of patterns/multi-stage-llm-pipeline-over-large-context and of postmortems as data goldmine; also canonicalises concepts/lost-in-the-middle-effect and concepts/surface-attribution-error as distinct LLM failure modes the multi-stage architecture was designed around. Quantified 2-year outcome disclosed: automated change validation for infrastructure-as-code "shielded" 25% of subsequent datastore incidents; AWS ElastiCache 80% CPU ceiling surfaced as a strategic capacity-planning hotspot. Current generation runs on Claude Sonnet 4 on AWS Bedrock — the Gen-0 (NotebookLM) → Gen-1 (LM Studio on-prem with 3B/12B/27B open-source models) → Gen-2 (Bedrock frontier tier) transition was driven by compliance / legal review, not capability. Human curation glide path: 100% of outputs during development → 10–20% random sampling at maturity, with proofreading of the final Patterns-stage one-pager remaining a non-negotiable gate.

  1. Postgres-on-Kubernetes / kernel-level latency (opened 2020-06-23) — empirical kernel-level measurement (perf, softirq tracepoints, network namespaces) combined with operator-level deployment pragmatism.
  2. Frontend platform evolution / micro-frontends → entity-based composition → concurrent React (opened 2021-03-10, extended 2023-07-10) — Part 1 of a series on Zalando's second-generation frontend platform Interface Framework (IF; designed 2018, ~90% traffic by March 2021), which replaces the 2015-era Project Mosaic Fragment-based micro-frontend architecture with an entity- based page-composition model: pages are request-time trees of typed Entities (Product, Collection, Outfit) chosen by personalisation, and [[patterns/entity-to- renderer-mapping|Renderers]] (one-per-Entity-type React components) are the contribution unit. The Rendering Engine (Node.js + browser runtime) walks the tree, applies declarative rendering rules, and composes the output. Cross-cutting concerns (monitoring, consent, A/B testing via Octopus, design system, bundle-size optimisation) move into the platform, and every PR is gated by Lighthouse CI + Bundle Size Limits + Web Vitals. The axis pairs tightly with axis 6 (the GraphQL BFF is IF's data aggregation layer). The 2023-07-10 Rendering Engine Tales post (sources/2023-07-10-zalando-rendering-engine-tales-road-to-concurrent-react) extends the axis with a React 18 / concurrent-rendering migration chapter: each Renderer becomes a <Suspense> boundary; renderToPipeableStream + hydrateRoot replace renderToNodeStream + hydrate (A/B-measured Fashion Store impact: INP −5.69 % / FID −8.81 % / LCP −2.43 % / FCP −0.23 % / bounce −0.24 %, Catalog page biggest wins at FID −17.11 %); Zalando deliberately rejects hook-based render-as-you-fetch in favour of an Application- State layer outside React where resolveEntity writes data into a central store and a Redux-useSelector-shaped Connector hook returns data or throws a Promise (four upstream blockers drove this: SuspenseList experimental, useTransition not nested-Suspense-aware, hook-initiated fetch timing coupled to render order, React data-streaming cache not final). The stricter React-18 hydration surfaced a production hydration-mismatch taxonomy across "hundreds of Renderers": timers (fix: suppressHydrationWarning), timezone-localised dates (fix: explicit timeZone or backend-localize), a Safari-de-AT-Intl.NumberFormat runtime divergence where the thousand-separator differs between Safari's JavaScriptCore and Node.js's V8 (application-unfixable; only backend-localize works), and invalid HTML nesting (<div> inside <p>, <button> inside <button> — React-18 mismatch). General escape hatch: patterns/mount-gated-client-only-rendering. Observability pattern: only forward the first onRecoverableError per session to Sentry (patterns/first-error-only-hydration-error-reporting) because post-first-error the fiber-vs-DOM list alignment cascades. Deferred to future posts: ordered-streaming/ hydration technical details, final-architecture Fashion Store impact, React Server Components.
  3. Upstream JDBC-driver contribution for Postgres logical replication (opened 2023-11-08) — Zalando's 2023-11 post (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver) pairs the SBOM-driven dependency-governance axis with the complementary ecosystem-contribution altitude: when Zalando's fleet-wide operational pain surfaces a runaway-WAL-growth failure mode in the Postgres logical-replication protocol as implemented by pgjdbc, Zalando diagnoses the root cause (pgjdbc not responding to Postgres KeepAlive messages that carry the server's current LSN), builds the pure fix (KeepAlive- message LSN advancement — ack the KeepAlive-reported LSN when all Replication messages are flushed), upstreams it as pgjdbc PR #2941 (merged 2023-08-31, shipped in pgjdbc 42.7.0), and rolls out a locally-built 42.6.1-patched backport to their fleet via a parallel prod-vs-test Docker image split (the patched image goes to test first, verified via a multi-day flat- WAL-size graph, then promoted to production). Canonical instance of patterns/client-driver-fix-over-application-workaround at the JDBC-driver altitude — complements the wiki's existing language-runtime upstream-the-fix instances (Meta jemalloc / WebRTC, Cloudflare V8 / Go, Netflix Java 21 virtual threads) with a database-driver-layer example. Load-bearing platform context: Zalando's low-code Postgres-sourced event-streaming platform runs "hundreds" of per-stream micro-applications embedding Debezium Engine (the embedded-library variant of Debezium, distinct from Kafka-Connect-hosted Debezium); the fleet-scale shape (many independent replication slots against shared-WAL Postgres primaries) is exactly the shape that exposes the WAL-pinning asymmetric-tables bug. First wiki canonical instances of pgjdbc (the JDBC-driver load-bearing every JVM Postgres client), of Debezium Engine (the embedded-library CDC shape distinct from Kafka Connect), of Zalando's event-streaming platform, and of the dummy-write heartbeat kludge (the industry-standard application- layer workaround the driver-level fix replaces). Two-year sequel (2025-12-18, sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale): Debezium subsequently hard-disabled the pgjdbc keepalive flush via withAutomaticFlush(false) in PR #6472 because the feature conflicted with Debezium's own LSN management and broke the offset-store contract for most Debezium users — blocking Zalando's upgrade path. Zalando upstreamed two remediation contributions to Debezium 3.4.0.Final (released 2025-12-16): (a) lsn.flush.mode (DBZ-9641 / PR #6881) — three-mode enum (manual / connector default / connector_and_driver) making the pgjdbc flush opt-in; canonical instance of patterns/opt-in-driver-level-lsn-flush. (b) offset.mismatch.strategy (DBZ-9688 / PR #6948) — four-strategy enum (no_validation default / trust_offset / trust_slot / trust_greater_lsn) letting operators pick which position source is authoritative on startup mismatch; canonical instance of patterns/authoritative-slot-over-authoritative-offset. Load-bearing architectural insight the post canonicalises: logical-replication position is tracked in two independent locations (Debezium offset store + Postgres replication slot), and the right reconciliation strategy depends on operator-side invariants Debezium cannot know. Zalando trusts the slot because they run Patroni-managed Postgres with slot-survives-failover discipline since the mid-2010s plus MemoryOffsetBackingStore since 2018 — structurally opposite to the Kafka-Connect-offset-topic posture of most Debezium users. Both contributions ship with boolean → enum auto-mapping (flush.lsn.sourcelsn.flush.mode; internal.slot.seek.to.known.offset.on.startoffset.mismatch.strategy). Updated platform scale at publication: "hundreds of event streams" processing "hundreds of thousands of events per second" across 100+ Kubernetes clusters at peak; pre-disable production run on Debezium 2.7.4 + pgjdbc 42.7.2 was "nearly two years, processing billions of events, with zero detected data loss from this mechanism." First wiki canonical instances of systems/patroni (as a system), concepts/lsn-flush-mode, concepts/offset-mismatch-strategy, concepts/slot-vs-offset-position-tracking, concepts/memory-offset-backing-store, patterns/opt-in-driver-level-lsn-flush, patterns/authoritative-slot-over-authoritative-offset, and patterns/backward-compatible-config-migration.
  4. Destructive-automation blast radius / supertool safety net (opened 2024-01-22) — Adrian Chifor's 2024-01 postmortem of Zalando's November 2022 DNS outage (sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools) coins the term supertool for applications and scripts that wield fleet-wide destructive authority and names the canonical failure mode: a single p-typo turning YAML field metadata into metadpata collapsed an account-lifecycle job's accounts-in-scope set to empty, which its decommission code path interpreted as "all accounts decommissioned", triggering fleet-wide Route 53 hosted-zone deletion across the AWS Organization. "All of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries." Recovery: DNS outage recovery via cached-entries-before-TTL- expiry and tiered essential-tooling → core-infrastructure → on-site restoration, with rotating Incident Commanders by expertise area keeping the response focused across phases. The post catalogues a 5-layer containment stack that becomes the canonical wiki recipe for destructive-automation blast- radius reduction: (a) scream test1 week of Network ACL isolation + DNS delegation removal before real decommissioning (concepts/scream-test-for-deletion); (b) cost-weighted deletion deferral — low-savings resources excluded from automation with a 7-day cost-threshold gate; (c) triple-redundant jsonschema validation — IDE autocomplete via systems/yaml-language-server + pre-commit hook + CI pipeline, all against one schema, plus cfn-lint for CloudFormation templates; (d) PR preview of CloudFormation ChangeSet — bot reads CreateChangeSet from every AWS account in the organisation, merges into a human-readable PR comment, drops the ChangeSet (concepts/cloudformation-changeset); (e) phased rollout across release channels — extends the existing Kubernetes cluster-rollout shape to AWS accounts via named categories (playground → test → infra → production, concepts/release-channel-rollout). The post also names accelerated deletion pacing for cost savings as an amplifier: "As part of cost-saving measures, the pacing of executing deletion operations was sped up" — cost-optimization on destructive automation is itself a reliability risk. Axis 14 is the canonical wiki anchor for infrastructure-change safety nets at the AWS-account-lifecycle altitude, pairing with axis 5 (SRE evolution) at the operational-reliability altitude and axis 7 (frontend-platform) at the change-management-tooling altitude.
  5. AI-assisted content onboarding / catalog-attribute copilot (opened 2024-09-17) — the 2024-09 post (sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding) opens a new LLM/VLM-in-production axis for Zalando downstream of the existing ML-platform axis (2022-04) and the MDM axis (2021-07). The Content Creation Copilot is a four-service decomposition — Content Creation Tool (copywriter UI with purple-dot AI-provenance marker and pre-selected suggestions), Article Masterdata (system-of-record for opaque attribute codes + attribute sets per article type), Prompt Generator (the load-bearing orchestrator — materialises the prompt from the schema, runs the bidirectional code↔English translation layer, filters attributes via a category-attribute relevance mapping, selects input images per an empirical product-only- front > model-worn > other ranking), and the OpenAI LLM backend (GPT-4 Turbo at launch, migrated to GPT-4o). The system is explicitly framed as a model-agnostic aggregator: "we created an aggregator service - to integrate multiple AI services, leveraging a wider variety of data sources, such as brand data dumps, partner contributions, and images" — validated by the GPT-4 Turbo → GPT-4o swap with no downstream contract changes. The UX layer is the pre-select-with-visual-disclosure pattern: shift the copywriter's workload from enrich-then-QA to QA-only, displacing ~25% of the content-production timeline. Production numbers at publication: ~75% accuracy, ~50,000 attributes enriched per week across 25 markets. Weak-spot disclosure: long-tail fashion vocabulary (e.g. deep scoop neck) underperforms on balanced eval sets; unbalanced (production- representative) eval sets give higher headline numbers — an eval-set-design caveat that sits independent of model choice. First wiki canonical instances of systems/zalando-content-creation-copilot, systems/zalando-content-creation-tool, systems/zalando-article-masterdata, systems/zalando-prompt-generator, systems/gpt-4o, concepts/opaque-attribute-code-translation-layer, concepts/ai-provenance-ui-indicator, concepts/category-attribute-relevance-mapping, concepts/input-image-selection-tradeoff, patterns/model-agnostic-suggestion-aggregator, and patterns/pre-select-ai-suggestions-with-visual-disclosure. Sibling to Instacart's PARSE (2025-08-01) — same pattern and same concept in a different catalog domain (fashion vs. grocery); Zalando's copilot is the thinner, less- platformised, human-in-the-copywriting-loop production instance.
  6. Ingress control-plane scaling / coalescing proxy pattern (opened 2025-02-16) — the 2025-02 post (sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster) opens a new control-plane-under-fan-out axis downstream of the infrastructure / platform-engineering line that runs through axis 1 (Postgres-on-K8s), the 2020-06-30 Skipper+ingress-stack launch, and axis 15 (es-operator partial failures). At ~180 Skipper pods per cluster × 200 clusters, each Skipper independently polled the Kubernetes API for Ingress + RouteGroup; etcd was overwhelmed and the API-server CPU throttled, threatening pod scheduling — canonical instance of concepts/control-plane-fan-out-to-kubernetes-api. Remediation: insert Route Server (routesrv) as a single coalescing proxy between Skipper and the Kubernetes API, polling every 3 s and serving Skippers over HTTP with ETag / 304 (concepts/etag-conditional-polling); Skipper keeps the last routing table if routesrv goes away (concepts/last-known-good-routing-table). The rollout itself is the second load-bearing artifact — a three-mode False / Pre / Exec config flag (patterns/three-mode-rollout-off-shadow-exec) where Pre is an explicit shadow mode in which operators git diff Skipper-computed vs routesrv- computed Eskip before any cluster flips to Exec. Kubernetes Informers explicitly rejected because they preserve the N× fan-out shape at change events. Result: Skipper HPA raised from ~180 to 300 pods/cluster, zero downtime, zero GMV loss. First wiki canonical instances of systems/zalando-route-server, concepts/control-plane-fan-out-to-kubernetes-api, concepts/etag-conditional-polling, concepts/last-known-good-routing-table, concepts/polling-interval-as-freshness-budget, patterns/control-plane-proxy-with-etag-cache, and patterns/three-mode-rollout-off-shadow-exec.
  7. LLM-powered code migration / frontend UI-library migration (opened 2025-02-19) — the 2025-02 post (sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries) opens a new axis distinct from but downstream of Axis 16's AI-assisted-content-onboarding copilot. Where Axis 16 canonicalises LLM-in-production at the catalog-enrichment altitude (four-service aggregator generating attribute suggestions for humans to QA), Axis 19 canonicalises LLM-in-engineering at the code-transformation altitude — a Python-based Component Migration Toolkit built by Partner Tech to migrate 15 B2B applications between two in-house UI component libraries. The load-bearing artefact is a frozen prompt with a three-layer Interface + Mapping + Examples composition — discovered through five offline iterative experiments (concepts/iterative-prompt-methodology) during an internal hackathon, not a runtime judge-LLM refinement loop. The mapping layer ("convert size='medium' to size='large' due to visual equivalence") is manually verified by pair programmers + designers — canonical wiki anchor for concepts/visual-equivalence-mapping, the information class source code alone cannot reveal. Production discipline: temperature=0 for reproducibility (concepts/temperature-zero-for-deterministic-codegen); <updatedContent> opaque output fencing (concepts/opaque-output-format-fencing) — same industry-convergent shape as Slack's <code></code> in the Enzyme→RTL codemod; 4K-output-token recovery via the conversation API and a literal "continue" prompt (concepts/continue-prompt-for-truncated-output) — "a simple 'continue' prompt proved more reliable than more complex prompts"; static/dynamic prompt partitioning for prompt-cache hits (concepts/static-dynamic-prompt-partitioning, patterns/prompt-cache-aware-static-dynamic-ordering) — static prefix first, <file>{file_content}</file> last; logical component grouping to keep context tokens in the 40–50K accuracy sweet spot (concepts/logical-component-grouping-for-context-budget, patterns/grouped-component-batched-migration); LLM-generated examples riding in the prompt and replayed in CI as prompt-regression tests (concepts/llm-generated-prompt-regression-test). Production numbers: ~90% accuracy, < $40 per code repository via GPT-4o pricing, 30–200 seconds per file. Axis 19 is the wiki's canonical instance of the LLM-only code-migration pipeline pattern — a structural contrast-pair with Slack's AST+LLM hybrid (same problem class, different application-layer choice: the human effort that would go into AST-rule authoring goes into mapping-verification instead) and a scale-sibling with Cloudflare's vinext framework rewrite (both LLM-authored code at production scale, different task shape — whole- rewrite vs bulk transformation). Tool-surface substrates: continue.dev for prompt authoring; llm library + OpenAI API
  8. ZEOS probabilistic-forecast + black-box inventory optimisation on zFlow / hybrid online+offline serving (opened 2025-06-29) — the 2025-06 post (sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive) opens a new B2B partner-facing ML product axis downstream of axis 10 (Zalando Payments real-time inference on zFlow, 2021-02-15) and axis 11 (ML Platform / zFlow overview, 2022-04-18). Axis 20 is the third publicly-named zFlow workload and the first one explicitly combining batch + real-time delivery against the same feature store — the ZEOS Inventory Optimisation System runs a Demand Forecaster weekly (5M SKUs, 3-year sliding window, 12-week probabilistic horizon via LightGBM + Nixtla MLForecast, full pipeline < 2 hours) and a Replenishment Recommender daily batch + online interactive via Monte Carlo simulation + gradient-free black-box optimiser. Load-bearing canonicalisations: (a) the two-tier feature-engineering split (concepts/data-preprocessing-vs-data-transformation-split) — PySpark+Databricks horizontal tier + SageMaker Processing Job vertical tier; (b) SageMaker Feature Store in both online and offline modes with an explicit parity invariant (patterns/online-plus-offline-feature-store-parity) — first wiki instance of dual-mode feature-store architecture; (c) single SageMaker Training Job train-and-infer — lightweight LightGBM model bypasses the separate inference-hosting tier entirely, no checkpointing; (d) async SQS → Lambda → multi-threaded optimiser (patterns/async-sqs-lambda-for-interactive-optimisation) for interactive what-if from the partner portal, with side-effect write-back to the offline feature store so future batch runs stay consistent; (e) proactive cache of daily batch predictions — precompute for all merchants × articles so the dashboard read path is a KV lookup; (f) decoupled cadences — weekly forecast + daily optimise + online what-if (patterns/weekly-batch-forecast-daily-batch-optimise-cadence); (g) drift monitoring as a pipeline stage via SageMaker Processing Job + CloudWatch alarms + Lambda; (h) model choice rationale over deep-learning (TFT etc.) — lightweight footprint + ecosystem + rapid prototyping + conformal inference for probabilistic output. First wiki canonical instances of systems/zeos-inventory-optimisation-system, systems/zeos-demand-forecaster, systems/zeos-replenishment-recommender, systems/aws-sagemaker-feature-store, systems/sagemaker-processing-job, systems/sagemaker-training-job, systems/sagemaker-batch-transform-job, systems/mlforecast-nixtla, systems/lightgbm, systems/numba, and all 14 new concept + 8 new pattern pages listed above.
  9. AI-powered postmortem analysis / fleet-insight mining (opened 2025-09-24) — the 2025-09 post (sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis) opens a lateral AI-assisted companion to axis 5's SRE discipline: a five-stage LLM map-fold pipeline ( Postmortem Analysis Pipeline) mines "thousands of archived postmortem documents" across the datastore Tech RadarPostgres / AWS DynamoDB / AWS S3 / AWS ElastiCache / Elasticsearch — to surface recurring failure patterns and "investment opportunities" for engineering leadership. Five-stage architecture: Summarization (TELeR-shaped + refuse-on-ambiguity) → Classification ([[patterns/ negative-example-prompting|negative-example-shaped]] against surface-attribution) → Analyzer (3–5-sentence causal digest, human-curation pivot point) → Patterns (LLM fold over all digests into a one-pager) → Opportunity (human-authored investment proposal). Canonical first wiki instance of patterns/multi-stage-llm-pipeline-over-large-context as an alternative to single large-context prompting motivated by lost in the middle. Canonical first-wiki datums: (a) surface attribution error quantified at ~10% on Claude Sonnet 4, structural-not-scale; (b) hallucination rate evolution: small open-source 3B–12B at up to 40% → prompt-hardened + curated to < 15% → Claude Sonnet 4 "negligible"; (c) transition driver NotebookLM LM Studio on-prem → AWS Bedrock is compliance / legal review, not capability; (d) human-curation glide path 100% during development → 10–20% random sampling at maturity with non-negotiable proofreading of the final Patterns-stage one-pager; (e) postmortems as data goldmine reframing of the archived corpus. Quantified 2-year outcome: automated change validation for infrastructure-as-code "shielded us from 25% subsequent datastore incidents"; AWS ElastiCache 80% CPU ceiling at peak surfaced as a strategic capacity-planning hotspot. Two new Postgres failure disclosures also surfaced by the pipeline (extending axis 13): a Postgres 12 AUTOVACUUM LAUNCHER race condition crashing connection pools and a Postgres 16→17 upgrade triggering a logical-replication memory-leak bug under parallel DDL + heavy transactions. Pipeline scope limited to public technologies only — Zalando-internal systems like Skipper produce "unacceptable analysis" and are flagged as the fine-tuning roadmap. Explicitly not an agentic solution: "the initial concept of a no-code agentic solution was quickly deemed unfeasible due to performance limitations, inaccuracies, and hallucinations encountered during prototype development" — the pipeline is the control structure, not an agent loop.
  10. Rendering Engine + React Native mobile migration / brownfield RN at consumer-app scale (opened 2025-10-02) — the 2025-10 post (sources/2025-10-02-zalando-accelerating-mobile-app-development-with-rendering-engine-and-react-native) extends axis 7 (Rendering Engine) from web-only to cross-platform (web + iOS + Android) and opens a new architectural axis: brownfield React Native integration at consumer-app scale (52M+ customers, 90+ screens across two native codebases). Three load-bearing canonicalisations: (a) React Native as a package — RN root + init logic packaged as an npm Entry Point consumed by both a greenfield Developer App (standard RN toolchain, bundle-switching dev menu, mock interop contracts — unlocks web engineers for RN contribution) and a native Framework SDK (iOS + Android library exposing a simple ReactNativeViewFactory surface that the legacy app links like any other framework); generalised by Callstack's react-native-brownfield package. (b) Turbo Module + DI contract — three-language API contract (TypeScript + Swift + Kotlin) with a DI injection slot for the legacy app's implementation, so wishlist-badge style RN↔native interop doesn't couple the SDK to legacy source; lesson: "first properly define these API contracts … otherwise, you run into challenges where the API design might not be feasible on all platforms" (patterns/api-contract-first-across-three-languages). (c) Cross-platform UI via HTML-subset + tokensreact-strict-dom (HTML subset → RN primitives on mobile, plain HTML on web, zero runtime cost on web via build-time stripping) + StyleX + ZDS tokens for cross-platform theming, with Metro file-resolution (Foo.native.ts / Foo.ios.ts / Foo.android.ts) as the per-platform escape hatch and react-strict-dom's compat.native as the in-component escape hatch. Chosen over react-native-web on substrate-longevity + zero-runtime-on-web grounds. Validated at scale by the Discovery Feed migration — the new media-heavy front screen featured in Zalando's Q2 2025 financial results. Progressive adoption discipline: screen-by-screen with first screen low-traffic and simple for pipeline exercise before betting on a flagship. Structural contrast-pair with Shopify's greenfield team-by-team RN adoption (same "100% RN is not the goal" architectural posture at mixed-stack altitude, different starting-point inversion) and with Shopify's RN-new-architecture migration (both brownfield but at orthogonal altitudes: RN-architecture upgrade vs adding-RN- itself). First wiki canonical instances of systems/react-strict-dom, systems/stylex, systems/metro-bundler, systems/callstack-react-native-brownfield, systems/zalando-mobile-framework-sdk, systems/zalando-mobile-developer-app, systems/zalando-discovery-feed, systems/zalando-design-system-tokens, systems/react-navigation, systems/react-native-video, and all 9 new concept
    • 10 new pattern pages.
  11. Catalog-search self-inflicted DoS / per-caller observability & app-side admission control (opened 2025-12-16) — the 2025-12 post (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search) opens a standalone Search & Browse altitude on the wiki: the multi-layer Zalando Catalog Search substrate (Catalog APINER query builderSearch API wrapping Base Search Elasticsearch; enrichment by Algorithm Gateway and Promotions Bidding; downstream by Zalando Assistant). Canonical first-wiki datums: (a) self-inflicted DoS — an internal app issuing 20–100 req/s (vs cluster baseline "thousands of req/s") of high- cardinality terms aggregations on the SKU field pinned coordinator CPU and starved the search thread pool, producing multi-market customer-visible "search is slow" / "filters are broken" for hours despite low volume by any standard metric. (b) The 5 reasons the incident hid enumerated verbatim as a diagnostic-gap template for zebra write-ups. (c) Zebra-not-horse heuristic — the investigator's bias-checker when the first-line horse-hypotheses have been eliminated and the symptom persists. (d) Adaptive Replica Selection named as ES's in-cluster analogue of power-of-two-choices, and specifically noted as unable to save the cluster because routing doesn't help when every shard copy is saturated. Three-piece follow-up program: (i) X-Opaque-Id client attributionper-client slow-query dashboard (observability); (ii) app-side query limiting with dynamic thresholds (first-gate admission control); (iii) search.max_buckets cluster-wide guardrail (last-line defence). Incident-time mitigation via market split through node.attr.market allocation filters — the incident-time sibling of axis 18's steady-state market group isolation, applied at the ES storage tier rather than the PRAPI serving-API tier. Root cause identified via a Lightstep trace-exploration notebook that spotted the offending caller at 50× baseline fan-out — a trace-altitude rescue of a metric-altitude investigation. First wiki canonical instances of systems/zalando-catalog-search, systems/zalando-base-search, systems/zalando-catalog-api, systems/zalando-search-api, systems/zalando-ner-query-builder, systems/zalando-algorithm-gateway, systems/zalando-promotions-bidding, systems/zalando-assistant, concepts/self-inflicted-dos, concepts/high-cardinality-aggregation-overload, concepts/adaptive-replica-selection-elasticsearch, concepts/x-opaque-id-client-attribution, concepts/zebra-not-horse-heuristic, patterns/split-cluster-by-market-for-load-isolation, patterns/application-side-query-limit-with-dynamic-threshold, patterns/per-client-slow-query-dashboard, and patterns/cluster-wide-aggregation-guardrail.
  12. Search quality assurance with AI as a judge / pre-launch market validation (opened 2026-03-16) — the 2026-03 post (sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge) opens the evaluation-side companion to axis 23's serving-side self-inflicted-DoS axis: Zalando's Search & Browse team shipped an offline LLM-as-a-judge evaluation framework (Search Quality Framework) built to validate catalog-search quality before launching into a new country with no user-data signal. Applied to Zalando's 2025 Luxembourg / Portugal / Greece launches. Load-bearing canonicalisations: (a) concepts/pre-launch-market-validation as a named problem class — "For an entirely new country, these signals are by definition not there yet"; the LLM-judge-shaped substitute for click-based QA. (b) NER-clustered query sampling via the Search Query Clustering pipeline — the NER engine's tag sets that cluster production queries for sampling also segment the output for diagnostic aggregation. (c) Visual-text relevance judgmentGPT-4o scores (query, product) pairs on a 0–4 rubric using product data + images as evaluation context with generalised reasoning (no per-attribute prompts). (d) Translated-query parity operationalised as cross-language NER-tag diffsa diagnostic sidecar isolating NER-vocabulary issues from ranker / catalogue issues. Four disclosed PT violation shapes: lemmatisation drift ("desporto" / "desportivo" / "desportiva"), ambiguous collision ("tenis" / "ténis" vs tennis the sport), missing vocabulary ("menina", "meninas"), and multi-word term unrecognised ("fato de treino" tracksuit). (e) Segment-level root-cause diagnosis — three named failure classes surface as different segment-level patterns (incorrect product attributes; unrecognised NER terms; undiscoverable brand categories). Brand- wide failure example disclosed: BRAND=foo across 5 category segments all scoring 1.5–1.9 / 4.0. (f) (query, product) evaluation cache on ElastiCache, scoped only to evaluation tasks — collapses naive 5000 × 25 product-API + LLM calls to |unique products|; makes re-runs near-free and enables regression-detection on live markets. (g) Airflow TaskGroup parallelismone TaskGroup per market, fan-out / fan-in with a consolidation task; each stage shipped as a Docker image via KubernetesPodOperator. Operational numbers disclosed: ~$250 per full run (GPT-4o completions dominate), 3–5 hours runtime, 1,500 segments × 25 results per market, 3 markets in parallel. Sibling to axis 16 ( Content Creation Copilot) on the AI-for-search/ AI-for-catalog axis — this axis targets validation rather than generation, and invokes GPT-4o as a judge rather than an attribute- extractor. Sibling to axis 18 ( Postmortem Analysis Pipeline) on the LLM-for-operations axis — this is Zalando's third publicly-disclosed production LLM system, after the copilot and the postmortem analyser. First wiki canonical instances of systems/zalando-search-quality-framework, systems/zalando-search-query-clustering, concepts/pre-launch-market-validation, concepts/ner-clustered-query-sampling, concepts/translated-query-parity, concepts/visual-text-relevance-judgment, concepts/segment-level-root-cause-diagnosis, concepts/query-product-evaluation-cache, concepts/ner-tag-parity-across-languages, concepts/automated-test-generation-from-production-traffic, concepts/airflow-taskgroup-parallelism, patterns/llm-as-judge-for-search-quality, patterns/ner-clustered-query-sampling-from-production, patterns/segment-level-relevance-dashboard, patterns/query-product-evaluation-cache, patterns/per-market-parallel-taskgroup-dag, patterns/translated-query-ner-parity-check, and patterns/podoperator-encapsulated-evaluation-job.
  13. Flink state discipline / Table-API → DataStream-API multi-way-join rewrite on AWS Managed Flink (opened 2026-03-03) — the 2026-03 post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions) opens the stream-processing operational-cost axis, structurally adjacent to axis 23's Catalog Search Elasticsearch blast-radius axis but at a different subsystem: the Search & Browse team's Product Offer Enrichment Flink job on AWS Managed Flink 1.20. The original 4-way Table-API JOIN chain (offer + boost + sponsored + product events, keyed on SKU) hit state amplification — each join operator in Flink 1.x keeps its own independent RocksDB copy of both inputs for late-arrival correctness — and grew application state to 235–245 GB. Hourly savepoints became the dominant workload: "keep the cluster's CPU at 100% for nearly 12 minutes", producing backpressure, crash-restart loops, missed snapshots, and a permanent 10–20 % overscale margin on KPUs. The rewrite collapsed the chain into a single custom DataStream-API KeyedProcessFunction union(...) → keyBy(SKU) → MultiStreamJoinProcessor — with [[patterns/single-valuestate-over-chained-joins|one ValueState[EnrichmentState] per SKU]] and event-time / content filtering before state.update to avoid redundant RocksDB writes. Impact: state 235 GB → 56 GB (−76 %), snapshot duration 11 min → 2.5 min (−77 %), CPU 100 %-spike → ~30 %-stable, restart time 12–20 min → 4–5 min, AWS cost −13 % (sub-proportional because KPUs bundle vCPU + RAM + storage 1+4+50 — the savings came from dropping the overscale margin, not from proportional KPU reduction). The closing note is that Flink 2.1's experimental MultiJoin operator (FLIP-516, 2025-05-19) implements the same idea natively ("2x to over 100x+ increase in processed records; 3x to over 1000x+ smaller state") but managed-runtime version lag forced the DataStream rewrite: "we're covered by our home-baked solution." The canonical framing — Flink SQL is perfect for 90 % of use cases; a software engineer's value is in recognising the remaining 10 % — takes its place as this wiki's declarative-vs-imperative stream-API stanza. First wiki canonical instances of systems/flink-table-api, systems/flink-datastream-api, systems/aws-managed-flink, systems/flink-multijoin-operator, systems/zalando-product-offer-enrichment, concepts/flink-stateful-join-state-amplification, concepts/flink-snapshot-savepoint, concepts/flink-keyed-stream-union, concepts/kpu-aws-managed-flink, concepts/multi-way-join-operator-flink, concepts/declarative-vs-imperative-stream-api, patterns/stream-union-plus-keyed-process-function, patterns/single-valuestate-over-chained-joins, and patterns/event-time-filter-for-state-write-reduction.
  14. Skipper admission-time route validation / shift-left ingress correctness (opened 2026-04-08) — the 2026-04 post (sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time) opens the admission-path control-plane enforcement tier as the second Skipper-specific control-plane axis after axis 17. Zalando's Skipper extends Kubernetes's Ingress / RouteGroup with its own predicates, filters, and backend DSL that the API server's standard admission pipeline cannot validate — a typo like Headr("X-Canary", "true") instead of Header(...) is valid YAML but broken Skipper. Rather than build a second validator, Zalando deployed a Kubernetes validating admission webhook (ingress-admitter.teapot.zalan.do) that reuses Skipper's own filter registry, predicate specs, route parser, and backend parser to validate objects at admission time (patterns/reuse-runtime-logic-on-admission-path). The webhook propagates Skipper's literal error message through the Kubernetes deny response — "predicate 'NonExistingPredicate' not found" — so engineers fix the manifest in place instead of tracking down a broken route later from a support channel (concepts/shift-left-validation). Load-bearing canonicalisations: (1) concepts/validating-admission-webhook as a named primitive with the admission-pipeline diagram and the failurePolicy: Fail vs Ignore trade-off. (2) concepts/shift-left-validation as the general engineering stance of moving a correctness check from runtime to write-time. (3) concepts/control-plane-change-blast-radius as the specialised concepts/blast-radius framing for control-plane enforcement: a bad webhook freezes CI/CD fleet-wide without affecting live customer traffic — different risk class than data-plane changes. (4) concepts/feature-flag-rollback-for-validator as the control-plane-enforcement specialisation of concepts/feature-flag: the -enable-advanced- validation flag gates the new Skipper-specific validation layer on top of the pre-existing basic validation, so a false-positive rollback is a flag-flip, not a binary redeploy. (5) concepts/invalid-route-observability-metric canonicalised by name — skipper_route_invalid{route_id, reason} — as the per-tier rollout gate that distinguishes real user mistakes (which the webhook should reject) from validator bugs (which mean the webhook is wrong). (6) patterns/reuse-runtime-logic-on-admission-path as the architectural pattern, contrasted against writing a second validator (drifts) or using a policy engine (correct for cluster-wide rules, wrong for domain-DSL validation where the runtime already has a validator). (7) patterns/invisible-rollout-via-default-on-validation as the rollout shape — tier-by-tier default-on enablement such that teams writing valid manifests experience zero change, with the flag + metric + error message discipline making that invisibility safe. Also adds Seen-in to existing patterns/feature-flagged-dual-implementation (applied to validator rollout instead of RN architecture migration). Scale framing: 250+ clusters, 15k+ ingresses, ~200k routes, 500k–2M RPS with 1% invalid routes = ~2,000 broken routes = real production risk; upstreamed to open-source Skipper v0.24.18+.

Key systems

Key concepts and patterns surfaced

Postgres-on-Kubernetes / kernel-level latency axis: observation of non-uniform kube-proxy load distribution. first-person reproduction with perf softirq tracepoints. fix. multi-core scaling primitive. reproducible benchmark recipe. Zalando Operator's default topology. the documented escape hatch. - patterns/static-site-via-ingress-proxy-to-s3-website — Skipper Ingress + S3 website endpoint as a CloudFront alternative when the ingress platform is already operated. - concepts/git-based-content-workflow — Zalando's PR-driven engineering blog publishing model. - concepts/reuse-existing-infrastructure-over-purpose-built-service — the explicit reasoning behind choosing Skipper over CloudFront for the blog.

Experimentation-platform axis:

Mobile testing discipline axis:

JVM integration testing discipline axis:

Cyber-Week prep / load-test automation axis:

Unified GraphQL BFF / API platform axis:

JVM language governance / Kotlin ADOPT ring axis:

MDM / knowledge-graph-driven data modeling axis:

  • concepts/master-data-management — the enclosing discipline; "technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets." Zalando chose consolidated-style MDM.
  • concepts/golden-record"a common, shared, and trusted view on data for a particular domain"; the output of consolidation over source systems.
  • concepts/logical-data-model — the schema of the golden record; generated from the knowledge graph rather than authored directly.
  • concepts/transformation-data-model — per-source-system mapping showing direct (1-to-1) vs. indirect (1-to-many, transformation-function) column → concept mappings. The worked System A / System B Address example (free-text address lines vs. structured street / zip / city / country_code fields) illustrates both mapping types.
  • concepts/semantic-layer-of-business-concepts — the graph of Concept / Attribute / Relationship nodes between source schemas and the target logical data model; the "shared conceptual vocabulary" that makes business- engineering alignment tractable.
  • concepts/knowledge-graph — extended with Zalando MDM as its third canonical wiki instance (alongside Dropbox Dash retrieval substrate and Netflix UDA enterprise-data- integration substrate). A new H2 in the knowledge-graph concept page contrasts the three altitudes.
  • concepts/data-lineage — extended with Zalando MDM as a design-time byproduct Seen-in, complementary to the existing Meta (enforcement-primitive) and Redpanda (agent-interaction envelope) framings.
  • patterns/knowledge-graph-for-mdm-modeling — the core pattern; System / Table / Column / Concept / Attribute / Relationship node schema, Python generator, direct vs. indirect mappings.
  • patterns/mapping-driven-schema-generation — the generalised pattern across MDM, Netflix UDA, and dbt-style data-build tools: make the mapping authoritative and derive both target schema and transformation code.
  • patterns/visual-graph-for-business-engineering-alignment — Neo4j-rendered graph diagrams as the primary business- engineering communication artifact, replacing SQL DDL / spreadsheets. Named by the post as the #1 benefit.

Supply-chain / SBOM-driven dependency governance

  • systems/syft — Anchore's SBOM generator; runs on every Zalando deploy's container image and emits a CycloneDX / SPDX document that feeds the data-lake corpus.
  • systems/grype — Anchore's vulnerability scanner over syft-generated SBOMs; powers the CVE-correlation layer on top of the SBOM corpus.
  • systems/cyclonedx · systems/spdx — the two canonical SBOM formats; Zalando names both as "common formats" for portability + tooling integration.
  • systems/dependabot · systems/scala-steward — per-repo dependency-update bots named by the Zalando post as the tactical complement to the fleet-wide SBOM corpus.
  • systems/log4j — the canonical Log4Shell forcing function cited by the post as the defining mass-patch event the SBOM platform was built to handle.

Postgres event-streaming platform / driver-layer fixes

Recent articles

systems/pgbouncer · systems/kubernetes · systems/postgis · systems/pg-tileserv · systems/leaflet · systems/openstreetmap · Spilo · systems/skipper-proxy · systems/external-dns · systems/kube-ingress-aws-controller · systems/octopus-zalando-experimentation-platform · systems/apache-spark · systems/randomizer-swift · systems/testcontainers · systems/spring-boot · systems/junit5 · systems/maven-surefire-plugin · systems/maven-failsafe-plugin · systems/localstack · systems/mockserver · systems/wiremock · systems/ryuk-testcontainers-reaper · systems/zalando-graphql-ubff · systems/graphql-jit · systems/graphql · systems/rfc-7807-problem-details · systems/zalando-interface-framework · systems/zalando-mosaic · systems/zalando-rendering-engine · systems/react · systems/typescript · systems/nodejs · systems/locust · systems/hoverfly · systems/zalando-load-test-conductor · systems/nakadi · systems/grafana · systems/amazon-ecs · systems/opentracing · systems/opentracing-toolbox · systems/zally · systems/fabric-gateway-zalando · systems/zalando-mdm-system · systems/neo4j · systems/zalando-adaptive-paging · systems/zalando-throughput-calculator · systems/zalando-slo-reporting-tool · systems/zalando-service-level-management-tool - concepts/example-based-test-constant-input-antipattern · concepts/type-class-driven-random-generator · concepts/test-pyramid · concepts/first-test-principles · concepts/h2-vs-real-database-testing · concepts/singleton-container-pattern · concepts/contract-testing · concepts/backend-for-frontend · concepts/unified-graph-principled-graphql · concepts/conways-law · concepts/micro-frontends · concepts/entity-based-page-composition · concepts/monorepo · concepts/graphql-error-extensions · concepts/error-action-taker-classification · concepts/problem-vs-error-distinction · concepts/graphql-error-propagation · concepts/schema-discoverability-gap-in-errors · concepts/declarative-load-test-api · concepts/kpi-driven-load-ramp-up · concepts/header-based-mock-switching · concepts/production-version-cloning-for-load-test · concepts/test-cluster-as-break-things-environment · concepts/adaptive-paging · concepts/sre-organizational-evolution · concepts/sre-program · concepts/critical-business-operation · concepts/symptom-based-alerting · concepts/operation-based-slo · concepts/service-tier-classification · concepts/error-budget · concepts/multi-window-multi-burn-rate · concepts/graceful-degradation · concepts/api-first-principle · concepts/tech-radar-language-governance · concepts/knowledge-graph · concepts/master-data-management · concepts/golden-record · concepts/logical-data-model · concepts/transformation-data-model · concepts/semantic-layer-of-business-concepts · concepts/data-lineage - patterns/property-based-testing · patterns/real-docker-container-over-in-memory-fake · patterns/failsafe-integration-test-separation · patterns/shared-static-container-across-tests · patterns/unified-graphql-backend-for-frontend · patterns/business-logic-free-data-aggregation-layer · patterns/per-platform-deployment-bulkhead · patterns/graphql-unified-api-platform · patterns/entity-to-renderer-mapping · patterns/page-performance-quality-gates · patterns/result-union-type-for-mutation-outcome · patterns/problem-type-for-customer-actionable-errors · patterns/error-extensions-code-for-developer-actionable-errors · patterns/live-load-test-in-production · patterns/declarative-load-test-conductor · patterns/kpi-closed-loop-load-ramp-up · patterns/mock-external-dependencies-for-isolated-load-test · patterns/header-routed-mock-vs-real-dependency · patterns/scheduled-cron-triggered-load-test · patterns/annual-peak-event-as-capability-forcing-function · patterns/situation-room-for-peak-event · patterns/unified-sre-team-over-federated · patterns/dogfood-as-adoption-proof · patterns/template-project-nudges-consistency · patterns/knowledge-graph-for-mdm-modeling · patterns/mapping-driven-schema-generation · patterns/visual-graph-for-business-engineering-alignment · patterns/postgres-extension-over-fork · patterns/database-as-tile-server-middleware-replacement · patterns/sbom-as-queryable-data-lake-asset · patterns/vulnerability-fleet-sweep-via-sbom-query · patterns/sbom-driven-dependency-bloat-audit · patterns/dependency-update-discipline · concepts/vector-tiles · concepts/sbom-software-bill-of-materials · concepts/container-extracted-sbom · concepts/uber-jar-metadata-loss · concepts/dependency-count-by-language-ecosystem · systems/syft · systems/grype · systems/cyclonedx · systems/spdx · systems/dependabot · systems/scala-steward · systems/log4j · systems/pgjdbc-postgres-jdbc-driver · systems/debezium · systems/debezium-engine · systems/zalando-postgres-event-streams · concepts/logical-replication · concepts/postgres-logical-replication-slot · concepts/runaway-wal-growth · concepts/keepalive-message-lsn-advancement · concepts/dummy-write-heartbeat-kludge · concepts/transitive-dependency-override-build · patterns/client-driver-fix-over-application-workaround · patterns/parallel-docker-image-prod-vs-test-for-patched-library · patterns/upstream-the-fix · systems/rds-health · systems/aws-rds · systems/aws-performance-insights · concepts/golden-signals-rds · concepts/database-fleet-standardisation · concepts/cpu-utilisation-ceiling-database · concepts/cache-hit-ratio-memory-pressure · concepts/storage-io-latency-sli · concepts/sql-efficiency-ratio · patterns/fleet-wide-methodology-via-cli · systems/zalando-content-creation-copilot · systems/zalando-content-creation-tool · systems/zalando-article-masterdata · systems/zalando-prompt-generator · systems/gpt-4 · systems/gpt-4o · concepts/opaque-attribute-code-translation-layer · concepts/ai-provenance-ui-indicator · concepts/category-attribute-relevance-mapping · concepts/input-image-selection-tradeoff · concepts/multi-modal-attribute-extraction · patterns/model-agnostic-suggestion-aggregator · patterns/pre-select-ai-suggestions-with-visual-disclosure · patterns/llm-attribute-extraction-platform · patterns/human-in-the-loop-quality-sampling · systems/zalando-prapi · systems/caffeine · systems/netty · systems/okio · systems/dynamodb · concepts/market-group-country-isolation · concepts/async-loading-cache-stale-window · concepts/stale-while-revalidate-cache · concepts/tail-latency-spike-during-queueing · concepts/event-stream-competing-sources-of-truth · concepts/legacy-format-emission-for-incremental-sunset · concepts/cqrs · concepts/conways-law · concepts/consistent-hashing · patterns/bounded-load-consistent-hashing · patterns/power-of-two-choices · patterns/api-as-single-source-of-truth-over-event-streams · patterns/accept-header-format-negotiation-for-legacy-sunset · patterns/async-refresh-cache-loader · patterns/lifo-queuing-for-tail-latency · patterns/market-group-isolation-for-serving-api · patterns/zero-allocation-cache-payload

Last updated · 542 distilled / 1,571 read