Skip to content

SYSTEM Cited by 1 source

PRAPI (Zalando Product Read API)

Definition

PRAPI (Product Read API) is Zalando's centralised serving tier for Product and Offer data, introduced as part of the 2022–2025 PODS (Product Offer Data Split) program. It replaces the legacy pattern where every consumer team subscribed to an event stream and rebuilt its own local store. PRAPI exposes three HTTP endpoints — /products/{product-id}, /products/{product-id}/offers, /product-offers/{product-id} — and is engineered to outperform every team's previous local cache so that "Call the Product Read API" becomes the canonical answer to "Where do I get Product data?" (Source: sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)

Architecture

Four independent Kubernetes Deployments, each with tailored scaling rules, communicating over end-to-end non-blocking I/O (Netty EventLoop + Linux-native Epoll):

Component Role
Updaters Consume the source event stream, subpartition by Product ID, batch-write to DynamoDB. Scale horizontally on lag up to the partition count.
Products Single-item GET. Uses Skipper Consistent Hash Load Balancing (CHLB) so each product-id maps to a specific pod, keeping hot products in that pod's local Caffeine cache.
Product-Offers Combined view for the Presentation API (Fashion Store GraphQL aggregator consumer).
Batch Unpacks batch requests into concurrent single-item lookups; routes via Power-of-Two Random Choices to the less-loaded of two randomly-selected backend pods; aggregates responses.

Underneath sits DynamoDB as the durable store; PRAPI is explicitly a "fast-serving caching layer on top of DynamoDB" whose purpose is to outperform it.

Cache tier

Caffeine async-loading cache with a 60-second TTL and a 15-second trailing stale window — in the last 15 s of an entry's life, a read returns the cached value immediately and triggers a background refresh from DynamoDB. This is the canonical stale-while- revalidate shape implemented at the application-cache altitude. (See concepts/async-loading-cache-stale-window.)

Market Groups (country isolation)

"To achieve a level of country-level isolation, multiple instances of PRAPI are deployed—known as Market Groups—with each serving a subset of our countries." The routing layer can dynamically shift traffic between Market Groups, isolating internal or canary test traffic from high-value country traffic. Populating a fresh Market Group from cold takes "mere minutes" because the bottleneck is DynamoDB write-capacity units rather than the Updater pod's fetch throughput. (See concepts/market-group-country-isolation, patterns/market-group-isolation-for-serving-api.)

Ingestion knobs (verbatim)

Each Updater pod:

  • Reads batches of 250 products from the source stream
  • Subpartitions events by Product ID
  • Issues 10 concurrent batch writes of 25 items to DynamoDB (up to 250 items in flight per pod)

At fleet scale this pushes the bottleneck into DynamoDB's write-capacity units — adjustable in the AWS control plane in minutes.

Consistent Hash Load Balancing for Products

The Products component has a hot-set sizing problem: "even if just 10% of our 10 million products are hot, caching 1 million large (~1000-line JSON) product payloads per pod is simply not feasible." Fix: CHLB at the Skipper ingress — each pod is assigned to multiple fixed positions on the hash ring, product-id hashes to a ring position, and the clockwise- nearest pod owns that product.

"This partitions our catalogue between the available pods, allowing small local caches to effectively cache hot products. The wider we scale, the higher the portion of our catalogue that is cached." (See concepts/consistent-hashing.)

Zalando's two upstream Skipper contributions

  1. Fixed-100-position placement (skipper#1712) — scale-up/down rebalancing previously caused mass cache invalidations; assigning each pod to 100 fixed ring locations reduces cache misses to 1/N where N is the previous pod count.
  2. Bounded-load (skipper#1769) — caps per-pod traffic at 2× the average; once exceeded, requests spill clockwise to the next non-overloaded pod, keeping hyped products (limited-edition Nike drops in Zalando's example) distributed. (See patterns/bounded-load-consistent-hashing.)

Advanced tuning (from the post)

  • JVM: Java Flight Recorder + JDK Mission Control to isolate GC pauses and NIO-thread blocking tasks.
  • Zero-allocation cache payloads. Products cached as ByteArray (not ObjectNode graph) — "reducing heap pressure." Product-Sets: responses kept gzipped in Okio buffers and concatenated directly in the response object, eliminating gunzip/re-gzip round-trips. (See patterns/zero-allocation-cache-payload.)
  • LIFO over FIFO queuing. Both the load balancer and the DynamoDB clients switched to LIFO to avoid long-tail latency spikes when queuing occurred. (See patterns/lifo-queuing-for-tail-latency.)
  • Two-client DynamoDB fallback. Primary client with 10 ms timeout, fallback client with 100 ms timeout for retries — isolates DynamoDB latency spikes from the primary's queue depth.

Legacy-format sunset strategy

PRAPI took ownership of all four legacy data shapes (Alpha, Beta, Gamma + the new standard) and exposes them via Accept header:

  • application/json — new standard
  • application/x.alpha-format+json — legacy (previously on event stream)
  • application/x.beta-format+json — legacy (previously on event stream)
  • application/x.gamma-format+json — legacy (from Presentation API)

In parallel, temporary PRAPI components re-emit the alpha and beta formats back onto the legacy event streams so legacy-producing applications could be decommissioned immediately while consumers migrated off the legacy formats within a fixed sunset period. (See patterns/accept-header-format-negotiation-for-legacy-sunset, concepts/legacy-format-emission-for-incremental-sunset.)

Performance (measured at Skipper ingress)

  • Single GET P99 < 10 ms on ~1,000-line JSON payloads with content-type transformations.
  • Batch GET up to 100 items tracks the P999 of single GET — i.e. the batch round-trip time is dominated by the slowest of its concurrent single-item lookups.
  • Latency improves under load. Higher QPS keeps the hot working set warmer across CHLB-owning pods.

Organisational split (Conway / CQRS)

PRAPI's serving architecture was paired with a Product department restructure into two stream-aligned teams: Partners & Supply (ingestion — command side) and Product Data Serving (aggregation/retrieval — query side). The post cites Martin Fowler's CQRS bliki and Conway's Law directly. (See concepts/cqrs, concepts/conways-law.)

Seen in

Last updated · 501 distilled / 1,218 read