SYSTEM Cited by 1 source
PRAPI (Zalando Product Read API)¶
Definition¶
PRAPI (Product Read API) is Zalando's centralised serving
tier for Product and Offer data, introduced as part of the
2022–2025 PODS (Product Offer Data Split) program. It
replaces the legacy pattern where every consumer team
subscribed to an event stream and rebuilt its own local store.
PRAPI exposes three HTTP endpoints —
/products/{product-id}, /products/{product-id}/offers,
/product-offers/{product-id} — and is engineered to
outperform every team's previous local cache so that "Call
the Product Read API" becomes the canonical answer to "Where
do I get Product data?" (Source:
sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)
Architecture¶
Four independent Kubernetes Deployments, each with tailored scaling rules, communicating over end-to-end non-blocking I/O (Netty EventLoop + Linux-native Epoll):
| Component | Role |
|---|---|
| Updaters | Consume the source event stream, subpartition by Product ID, batch-write to DynamoDB. Scale horizontally on lag up to the partition count. |
| Products | Single-item GET. Uses Skipper Consistent Hash Load Balancing (CHLB) so each product-id maps to a specific pod, keeping hot products in that pod's local Caffeine cache. |
| Product-Offers | Combined view for the Presentation API (Fashion Store GraphQL aggregator consumer). |
| Batch | Unpacks batch requests into concurrent single-item lookups; routes via Power-of-Two Random Choices to the less-loaded of two randomly-selected backend pods; aggregates responses. |
Underneath sits DynamoDB as the durable store; PRAPI is explicitly a "fast-serving caching layer on top of DynamoDB" whose purpose is to outperform it.
Cache tier¶
Caffeine async-loading cache with a 60-second TTL and a 15-second trailing stale window — in the last 15 s of an entry's life, a read returns the cached value immediately and triggers a background refresh from DynamoDB. This is the canonical stale-while- revalidate shape implemented at the application-cache altitude. (See concepts/async-loading-cache-stale-window.)
Market Groups (country isolation)¶
"To achieve a level of country-level isolation, multiple instances of PRAPI are deployed—known as Market Groups—with each serving a subset of our countries." The routing layer can dynamically shift traffic between Market Groups, isolating internal or canary test traffic from high-value country traffic. Populating a fresh Market Group from cold takes "mere minutes" because the bottleneck is DynamoDB write-capacity units rather than the Updater pod's fetch throughput. (See concepts/market-group-country-isolation, patterns/market-group-isolation-for-serving-api.)
Ingestion knobs (verbatim)¶
Each Updater pod:
- Reads batches of 250 products from the source stream
- Subpartitions events by Product ID
- Issues 10 concurrent batch writes of 25 items to DynamoDB (up to 250 items in flight per pod)
At fleet scale this pushes the bottleneck into DynamoDB's write-capacity units — adjustable in the AWS control plane in minutes.
Consistent Hash Load Balancing for Products¶
The Products component has a hot-set sizing problem: "even
if just 10% of our 10 million products are hot, caching 1
million large (~1000-line JSON) product payloads per pod is
simply not feasible." Fix: CHLB at the
Skipper ingress — each pod is
assigned to multiple fixed positions on the hash ring,
product-id hashes to a ring position, and the clockwise-
nearest pod owns that product.
"This partitions our catalogue between the available pods, allowing small local caches to effectively cache hot products. The wider we scale, the higher the portion of our catalogue that is cached." (See concepts/consistent-hashing.)
Zalando's two upstream Skipper contributions¶
- Fixed-100-position placement (skipper#1712) — scale-up/down rebalancing previously caused mass cache invalidations; assigning each pod to 100 fixed ring locations reduces cache misses to 1/N where N is the previous pod count.
- Bounded-load (skipper#1769) — caps per-pod traffic at 2× the average; once exceeded, requests spill clockwise to the next non-overloaded pod, keeping hyped products (limited-edition Nike drops in Zalando's example) distributed. (See patterns/bounded-load-consistent-hashing.)
Advanced tuning (from the post)¶
- JVM: Java Flight Recorder + JDK Mission Control to isolate GC pauses and NIO-thread blocking tasks.
- Zero-allocation cache payloads. Products cached as
ByteArray(notObjectNodegraph) — "reducing heap pressure." Product-Sets: responses kept gzipped in Okio buffers and concatenated directly in the response object, eliminating gunzip/re-gzip round-trips. (See patterns/zero-allocation-cache-payload.) - LIFO over FIFO queuing. Both the load balancer and the DynamoDB clients switched to LIFO to avoid long-tail latency spikes when queuing occurred. (See patterns/lifo-queuing-for-tail-latency.)
- Two-client DynamoDB fallback. Primary client with 10 ms timeout, fallback client with 100 ms timeout for retries — isolates DynamoDB latency spikes from the primary's queue depth.
Legacy-format sunset strategy¶
PRAPI took ownership of all four legacy data shapes (Alpha,
Beta, Gamma + the new standard) and exposes them via
Accept header:
application/json— new standardapplication/x.alpha-format+json— legacy (previously on event stream)application/x.beta-format+json— legacy (previously on event stream)application/x.gamma-format+json— legacy (from Presentation API)
In parallel, temporary PRAPI components re-emit the alpha and beta formats back onto the legacy event streams so legacy-producing applications could be decommissioned immediately while consumers migrated off the legacy formats within a fixed sunset period. (See patterns/accept-header-format-negotiation-for-legacy-sunset, concepts/legacy-format-emission-for-incremental-sunset.)
Performance (measured at Skipper ingress)¶
- Single GET P99 < 10 ms on ~1,000-line JSON payloads with content-type transformations.
- Batch GET up to 100 items tracks the P999 of single GET — i.e. the batch round-trip time is dominated by the slowest of its concurrent single-item lookups.
- Latency improves under load. Higher QPS keeps the hot working set warmer across CHLB-owning pods.
Organisational split (Conway / CQRS)¶
PRAPI's serving architecture was paired with a Product department restructure into two stream-aligned teams: Partners & Supply (ingestion — command side) and Product Data Serving (aggregation/retrieval — query side). The post cites Martin Fowler's CQRS bliki and Conway's Law directly. (See concepts/cqrs, concepts/conways-law.)
Seen in¶
- sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api — the canonical source; covers the full Updater → DynamoDB → Caffeine → CHLB → Netty → Skipper architecture plus Market Groups, bounded-load, LIFO, zero-allocation caching, and the legacy sunset strategy.
Related¶
- systems/caffeine — the in-process cache
- systems/netty — the NIO framework
- systems/dynamodb — the durable store under the cache
- systems/skipper-proxy — the ingress doing CHLB + bounded-load
- systems/okio — zero-copy gzipped-response concatenation
- systems/kubernetes — deployment substrate
- concepts/consistent-hashing · concepts/market-group-country-isolation · concepts/async-loading-cache-stale-window · concepts/cqrs · concepts/conways-law
- patterns/bounded-load-consistent-hashing · patterns/power-of-two-choices · patterns/api-as-single-source-of-truth-over-event-streams · patterns/market-group-isolation-for-serving-api · patterns/accept-header-format-negotiation-for-legacy-sunset · patterns/zero-allocation-cache-payload · patterns/lifo-queuing-for-tail-latency
- companies/zalando