Skip to content

PATTERN Cited by 1 source

Market Group isolation for serving API

Problem

A single serving deployment for a multi-country platform has a correlated failure profile: a hot-partition event, a bad config push, or a poisoned upstream message lands across every market at once. Traffic shifts (internal test, canary) go through the same fleet that serves production, so canary mistakes also blast across all markets.

At Zalando-scale (multi-market European e-commerce, high-value DACH traffic sharing infrastructure with smaller markets), this blast radius is unacceptable around Cyber Week and limited-edition drops.

Pattern

Deploy the serving API as multiple independent instances (Market Groups), each owning a country subset, with a dynamic routing layer that can shift traffic between groups.

  • Each Market Group has its own pods, cache, DynamoDB (or equivalent) write capacity, and Kafka consumers.
  • Failures in one Market Group don't propagate to others.
  • Internal / test / canary traffic gets its own low-value Market Group; promoting a deploy means shifting traffic to a higher-value group at the routing layer.
  • Cold-filling a new Market Group is bottlenecked on downstream write capacity (adjustable in minutes), not on the serving-layer bootstrap.

Zalando's PRAPI applies this explicitly:

"To achieve a level of country-level isolation, multiple instances of PRAPI are deployed—known as Market Groups— with each serving a subset of our countries. Routing configuration allows us to dynamically shift traffic between Market Groups, allowing us to isolate internal or canary test traffic from high-value country traffic." (Source: sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)

See concepts/market-group-country-isolation for the concept treatment and concepts/cell-based-architecture for the generalised form.

Key design choices

  • Axis of partitioning: country / market. The axis should match the axis along which blast-radius is most valuable to contain — country/market is natural for e-commerce because it matches legal, pricing, and business-value boundaries.
  • Stateful components per Market Group. Cache and database capacity must be per-group, not shared; otherwise a group failure contaminates shared state.
  • Stateless routing layer. The routing layer (e.g. Skipper at the ingress) is itself not partitioned by Market Group — it makes the routing decision per-request. This is what enables the "dynamic shift" capability.
  • Horizontal scaling within a group. Inside a single Market Group, requests still need to scale — typically via CHLB or P2C across that group's pods.

Trade-offs

  • Capacity fragmentation. Running K Market Groups means K × (hot-set cache size + base serving capacity). You give up some efficiency for isolation.
  • Configuration surface. The routing layer needs a config source of truth: which markets go to which group, and how traffic shifts between them.
  • Cold-fill cost. A new Market Group must be populated before it can serve; the post claims this is measured in minutes for PRAPI but is DynamoDB-write-capacity bound.
  • Not a data-residency guarantee. Market Groups alone don't ensure per-country data residency — if the underlying store spans regions, residency needs a separate control.

Adjacent tactics

  • Canary group as the target of new deploys. PRAPI describes this implicitly — isolating canary traffic from high-value country traffic.
  • Per-group WCU autoscaling. DynamoDB's per-group write capacity can be sized independently to account for market-specific peak shapes.

Seen in

Last updated · 501 distilled / 1,218 read