Skip to content

ZALANDO 2025-03-06

Read original ↗

Zalando — From Event-Driven Chaos to a Blazingly Fast Serving API

Summary

Zalando (2025-03-06) documents PODS (Product Offer Data Split) and its replacement serving layer PRAPI (Product Read API): the team's multi-year effort to dismantle a 2016-era event-driven product-data pipeline that had become the bottleneck for the whole platform. In the legacy design every consumer team had to subscribe to an event stream and build its own local store of Product + Price + Stock; payloads were

90% unchanged Product data shipped alongside every Stock/Price delta, which during Cyber Week produced 30-minute event delays. PODS' architectural fix is to make Product data retrievable via three HTTP endpoints (/products/{id}, /products/{id}/offers, /product-offers/{id}) that together outperform every team's local copy, so the answer to "where do I get Product data?" is finally "call PRAPI" rather than "replay events from the dawn of time." The serving tier is a Caffeine async-loading cache fronting DynamoDB, running on Netty with Linux-native Epoll, horizontally partitioned per-country via Market Groups (country-level blast-radius isolation) and routed via Skipper's Consistent Hash Load Balancing (CHLB) so that the 10% hot portion of Zalando's 10-million-product catalogue stays cached without requiring every pod to hold the whole hot set. Single-GET P99 is sub-10ms on ~1,000-line JSON payloads; batch-GET of up to 100 items tracks the P999 of single-GET. The post also describes two load-balancer contributions Zalando made upstream to Skipper — fixed-100-position pod placement to reduce scale-event cache invalidations to 1/N and bounded-load to keep hyped products from overloading their hash-ring owner — plus operational tuning (LIFO queuing over FIFO for tail latency, zero-allocation ByteArray-cached product payloads and Okio-buffer pre-gzipped product-sets to eliminate GC pauses on the NIO event loop, and a two-client DynamoDB fallback with 10 ms primary / 100 ms fallback timeouts). The organisational half of the story is CQRS-shaped: the Product department split into a Partners & Supply team (command side, ingestion) and a Product Data Serving team (query side, aggregation/retrieval), explicitly citing Conway's Law and Martin Fowler's CQRS bliki. Legacy sunset is handled via Accept-header-negotiated format emission (application/json new standard + three application/x.{alpha,beta,gamma}-format+json legacy formats) and by having PRAPI emit the legacy alpha/beta formats back onto the old event streams for clients not yet migrated — buying clients a fixed sunset window while letting legacy producing applications be decommissioned immediately.

Key takeaways

  1. The 2016-era microservices anti-pattern PRAPI dismantled was "every team subscribes to the event stream and builds its own local store." In Zalando's own framing: "A simple request—'I'm building a new feature and need access to product data. Where do I get it?'—had an unreasonable answer: 'Subscribe to our event stream, replay events from the dawn of time, and build your own local store.'" Teams without capacity inherited an upstream team's view, producing a competing- sources-of-truth whispering game — "product data was altered at each step, and by the end, it no longer resembled the original." The fix is explicit: API over event stream for consumer access (the producer side stays event-driven). (See patterns/api-as-single-source-of-truth-over-event-streams.)

  2. Offer-composition pipeline pathology: >90% of payload unchanged. The legacy pipeline merged Product + Price + Stock events and re-processed the whole composite for every change, even though "Frequent stock and price updates were processed alongside mostly static Product data, with over 90% of each payload unchanged—wasting network, memory, and processing resources." The fix is to split Product (static-ish) from Offer (fast-changing), store them independently, and only recompose for the Presentation API endpoint that needs the joined view. This is a classic storage-layout optimisation applied at the serving-pipeline altitude.

  3. PRAPI latency targets: single-item P99 ≤ 50 ms, batch P99 ≤ 100 ms; actual result sub-10 ms single-GET P99 on ~1,000-line JSON payloads, with batch-GET up to 100 items tracking the P999 of single-GET. "PRAPI performs better under load—as traffic increases, more of the product catalogue remains cached, reducing latency." This is the textbook consistent-hash-plus-hot-item effect: higher QPS keeps working set warm.

  4. Market Groups — country-level blast-radius isolation. "To achieve a level of country-level isolation, multiple instances of PRAPI are deployed—known as Market Groups—with each serving a subset of our countries. Routing configuration allows us to dynamically shift traffic between Market Groups, allowing us to isolate internal or canary test traffic from high-value country traffic." Per-Market-Group Updaters scale horizontally on Kafka lag up to partition count. Populating a fresh Market Group from cold takes "mere minutes" because the bottleneck is DynamoDB write capacity, not the fetch rate. (See patterns/market-group-isolation-for-serving-api, concepts/market-group-country-isolation.)

  5. Ingestion pod sizing (verbatim): each Updater pod "Reads batches of 250 products, Subpartitions events by Product ID, Issues 10 concurrent batch writes of 25 items to DynamoDB." So a single pod performs up to 250 concurrent DynamoDB items in flight (10 × 25). The knob is DynamoDB's write capacity units — the bottleneck is in AWS's control plane, not PRAPI's.

  6. Cache tier: Caffeine async-loading, 60 s TTL with 15 s stale window. "Using its async loading cache, we configured a 60 second cache time with the final 15 seconds as the stale window. In the last 15 seconds, retrieving a cache entry triggers a background refresh from DynamoDB." This is the canonical stale-while- revalidate shape at the application-cache altitude — serving latency stays bounded even on near-expiry hits because the refresh is asynchronous. (See concepts/async-loading-cache-stale-window, patterns/async-refresh-cache-loader.)

  7. Hot-set sizing problem: 10% of 10M products = 1M large payloads per pod — infeasible. "even if just 10% of our 10 million products are hot, caching 1 million large (~1000-line JSON) product payloads per pod is simply not feasible." Fix: Consistent Hash Load Balancing (CHLB) at the Skipper ingress — each backend pod lives at multiple fixed positions on a hash ring; product-id hash locates the owning pod. "This partitions our catalogue between the available pods, allowing small local caches to effectively cache hot products. The wider we scale, the higher the portion of our catalogue that is cached." CHLB is for the single-item products component; the batch component uses Power-of-Two Random Choices instead because it unpacks batches into concurrent single-item lookups aggregated on return.

  8. Two upstream contributions Zalando made to Skipper's CHLB. (a) Fixed-100-position placement — previously, scale-up/down rebalanced ring positions and mass- invalidated caches; assigning each pod to 100 fixed locations on the ring reduces cache misses to 1/N where N is the previous pod count (zalando/skipper#1712). (b) Bounded-load — caps per-pod traffic at 2× the average; "Once exceeded, requests spill over clockwise to the next non-overloaded pod, keeping hyped products cached and distributed" (zalando/skipper#1769). This is the canonical defence against the extremely-hot product tier — Zalando's example is limited-edition Nike sneaker drops. (See patterns/bounded-load-consistent-hashing.)

  9. JVM tuning via Java Flight Recorder + JDK Mission Control. "By capturing telemetry from underperforming pods and visualising it in JDK Mission Control, we identified Garbage Collection (GC) pauses and ensured no blocking tasks ran on NIO thread pools." Two concrete GC- avoidance tactics: Products — cache as a single ByteArray instead of an ObjectNode graph (reduces heap pressure per entry). Product-Sets — store gzipped responses in Okio buffers and concatenate them directly into the response object, eliminating unnecessary gunzip/re-gzip operations. The principle the post extracts: "The best way to eliminate GC pauses is to avoid object allocation altogether." (See patterns/zero-allocation-cache-payload.)

  10. LIFO over FIFO for tail-latency-sensitive queues. "In latency-sensitive applications, FIFO queuing can create long-tail latency spikes." Two instances: (a) Load balancer — switched to LIFO where queuing was unavoidable; (b) DynamoDB clients — primary client with 10 ms timeout, fallback client with 100 ms timeout for retries; "This prevented FIFO queuing on the primary client during DynamoDB latency spikes." The mechanism: under FIFO, a latency spike fills the queue head-end and every subsequent request serialises behind it; under LIFO the newest requests jump the line, so tail requests get served promptly at the cost of head-of-queue staleness. (See patterns/lifo-queuing-for-tail-latency, concepts/tail-latency-spike-during-queueing.)

  11. Sunset via Accept-header format negotiation. PRAPI took ownership of all four legacy data representations and exposes them as media types: "application/json — New standard format for all teams; application/x.alpha-format+json — Legacy (previously on event stream); application/x.beta-format+json — Legacy (previously on event stream); application/x.gamma-format+json — Legacy (from Presentation API)." In parallel, "temporary components within PRAPI emitted alpha and beta formats back onto the legacy event streams. This enabled legacy applications to be decommissioned immediately, while teams gradually migrated off the legacy formats within a fixed sunset period." This is the two-sided migration primitive — producers can decommission now, consumers migrate on their own schedule. (See patterns/accept-header-format-negotiation-for-legacy-sunset, concepts/legacy-format-emission-for-incremental-sunset.)

  12. Conway's Law / CQRS-shaped org split. "Following Conway's Law and CQRS principles, the Product department was restructured into two stream-aligned teams: Partners & Supply – Manages data ingestion (Command side); Product Data Serving – Focuses on aggregation and retrieval (Query side)." The Fowler CQRS bliki and Conway's Law are linked directly in the post. The organisational change is framed as required for the technical change, not a happy coincidence — the legacy ~350-team tangle of event subscribers wouldn't have migrated off without a clear Product-data steward team.

Systems

System Role Notes
systems/zalando-prapi Product Read API serving tier 4-component Kubernetes deployment: Updaters, Products, Product-Offers, Batch; per-Market-Group isolation
systems/caffeine In-process Java cache library Async-loading cache, 60s TTL + 15s stale window
systems/netty Java NIO networking framework EventLoop + Linux-native Epoll transport for end-to-end non-blocking I/O
systems/okio Square I/O buffer library Zero-copy gzipped-response concatenation to avoid gunzip/re-gzip
systems/dynamodb Key-value datastore under PRAPI Cache-miss fallback, 10 ms primary / 100 ms fallback client timeouts
systems/skipper-proxy Zalando ingress proxy CHLB + bounded-load; PRAPI contributed both upstream
systems/kubernetes Deployment substrate Each PRAPI component a Deployment; per-Market-Group instances

Concepts

Concept Why load-bearing here
concepts/consistent-hashing CHLB partitions catalogue across pods; hot products stay in their owner-pod's local cache
concepts/cqrs Product department split into command (Partners & Supply) + query (Product Data Serving) sides
concepts/conways-law Cited directly as justification for the org restructure mirroring the new architecture
concepts/market-group-country-isolation Multi-instance deployment per country-subset; routing shifts traffic between groups
concepts/async-loading-cache-stale-window Caffeine 60s TTL with 15s trailing window triggers background refresh on read
concepts/stale-while-revalidate-cache The general semantic the async-loading stale window implements
concepts/tail-latency-spike-during-queueing Why FIFO queuing serialises requests behind a head-of-queue stall; LIFO avoids it
concepts/event-stream-competing-sources-of-truth The "whispering game" pathology of multi-hop event-driven derivations
concepts/legacy-format-emission-for-incremental-sunset PRAPI emits alpha/beta back onto event streams so producers can decommission ahead of consumers
concepts/garbage-collection JFR-guided elimination of GC pauses; ByteArray over ObjectNode; Okio-buffer concatenation

Patterns

Pattern How PRAPI uses it
patterns/power-of-two-choices Batch component unpacks batches into concurrent single-item lookups routed by P2C to the less-loaded of two randomly-selected pods
patterns/bounded-load-consistent-hashing Skipper CHLB cap at 2× average per-pod; excess spills to next non-overloaded pod on the ring
patterns/api-as-single-source-of-truth-over-event-streams PRAPI supersedes per-team event-derived local stores; consumer access model moves from "subscribe and rebuild" to "call the API"
patterns/accept-header-format-negotiation-for-legacy-sunset One endpoint emits 4 formats via Accept header; sunset window negotiated per-consumer
patterns/async-refresh-cache-loader Caffeine's AsyncLoadingCache: reads in the stale window trigger a background DynamoDB refresh
patterns/lifo-queuing-for-tail-latency LB + DynamoDB client queues switched to LIFO to avoid head-of-queue stalls serialising subsequent requests
patterns/market-group-isolation-for-serving-api Country-subset blast-radius containment via multi-instance deployment with routing-layer traffic shift
patterns/zero-allocation-cache-payload Cache as ByteArray; store gzipped responses in Okio buffers and concatenate — avoid ObjectNode graphs and gunzip/re-gzip

Operational numbers (verbatim where available)

Figure Value Source
Legacy Cyber Week event delay Up to 30 minutes on stock/price events "During Cyber Week, stock and price events could be delayed by up to 30 minutes"
Legacy payload overhead >90% unchanged per event "over 90% of each payload unchanged—wasting network, memory, and processing resources"
Zalando engineering teams relying on Product data ~350 teams, thousands of applications "With ~350 engineering teams and thousands of deployed applications"
PRAPI single-item target P99 50 ms High-level requirement statement
PRAPI batch target P99 100 ms High-level requirement statement
PRAPI single-GET actual P99 Sub-10 ms on ~1,000-line JSON "sub-10ms P99 latency"
Batch-GET range Up to 100 items, scales predictably with item count "handling up to 100 items, scale predictably with an expected increase in response time, closely aligning with the P999 of single GETs"
Catalogue size ~10 million products "just 10% of our 10 million products are hot, caching 1 million large (~1000-line JSON) product payloads per pod is simply not feasible"
Hot fraction ~10% hot, ~90% cold "small hot, and large cold sections"
Updater pod read batch 250 products per batch "Reads batches of 250 products"
Updater pod concurrent writes 10 × 25-item batch writes to DynamoDB (250 items in flight) "Issues 10 concurrent batch writes of 25 items to DynamoDB"
Caffeine cache TTL 60 s total, 15 s trailing stale window "60 second cache time with the final 15 seconds as the stale window"
DynamoDB client timeouts 10 ms primary, 100 ms fallback "a primary DynamoDB client with a 10ms timeout and a fallback client with a 100ms timeout for retries"
CHLB ring placements per pod 100 fixed positions (Zalando contribution) skipper#1712
Bounded-load cap 2× per-pod average (Zalando contribution) skipper#1769
Legacy formats emitted for sunset 3 — alpha / beta / gamma + new application/json application/x.{alpha,beta,gamma}-format+json

Caveats / absences

  • No public QPS number for PRAPI's serving tier. The post claims "millions of requests per second" in the intro but does not disclose a measured peak or a Cyber Week figure.
  • No per-Market-Group pod count. How many pods host a single Market Group, and how many Market Groups exist, are not disclosed. Only "subset of our countries" is qualitative.
  • No DynamoDB cost disclosure. Write-capacity-units, read-capacity-units, and per-request cost for the steady- state fallback-read are not broken out.
  • No cache-hit rate disclosure. The "better under load → more cached → lower latency" claim is observed in cluster-wide latency graphs but the numeric hit rate is not stated.
  • No consumer-migration timeline. The "fixed sunset period" for alpha/beta/gamma format consumers is mentioned but the duration is not given.
  • No capacity-planning figure for Market Group cold fill. "populate a new Market Group in mere minutes if needed" is qualitative — no WCU number, pod count, or duration bound.
  • No failure-mode discussion for CHLB ring ownership loss. If a pod hosting a hot product dies, the hash-ring successor cold-fills; bounded-load protects against overload but the tail-latency impact during owner churn isn't quantified.
  • No Kafka details. The source event stream is not named; partition count is referenced only via "each Market Group's Updaters scale horizontally based on lag, up to the number of partitions in the source stream." (Zalando's general stream platform is Nakadi.)

Source

Last updated · 501 distilled / 1,218 read