SYSTEM Cited by 1 source
Zalando Communication Platform¶
Definition¶
The Zalando Communication Platform is the internal event-driven system that produces and dispatches every customer-facing communication Zalando sends — order confirmations, shipment updates, marketing pushes, brand-alert notifications, sales-campaign announcements — across push-notification and email delivery providers.
Architecture shape¶
The platform is a classic event-driven fan-in / fan-out:
- Ingress: Nakadi event bus carrying 1,000+ event types published by different teams Zalando-wide. Each event type that should trigger a customer message has a dedicated Event Listener inside the platform's Stream Consumer microservice.
- Internal backbone: RabbitMQ routes messages between the platform's many microservices — renderers, consent/ preference/blocklist checkers, eligibility checkers, template + tenant configuration stores.
- Egress: external service providers for push + email delivery.
Capacity is Kubernetes-managed (horizontal scaling on CPU / memory / endpoint-calls / backlog metrics), but the platform as a whole is bounded by how quickly downstream providers will accept traffic, so spikes accumulate backlogs.
Priority constraint¶
Not all communications are equal. Business stakeholders require that SLO-critical communications (order confirmations — Zalando's canonical CBO) process on-time even when low-priority traffic (marketing, brand alerts) is backed up. The platform therefore has explicit priority tiers exposed to event-type owners — P1 (critical), P2 (normal), P3 (bulk) in the 2024 post's example — and the ingress-rate control system respects them via per-priority AIMD coefficients (Source: sources/2024-04-22-zalando-enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm).
The load-shedding design (2024)¶
The 2024 post documents the platform's adoption of TCP- congestion-control-style ingress rate control inside the Stream Consumer:
- A Statistics Collector cron samples RabbitMQ P50 publish latency + publish-exception counts.
- A Congestion Detector compares those to thresholds and emits a binary "congested / not congested" decision.
- Per-event-type Throttle instances — one per Event Listener — run AIMD on their batch-size state variable with priority- specific coefficients so high-priority event types speed up faster and slow down less.
Un-consumed events remain on Nakadi (Kafka-backed, durable), which becomes the overflow buffer — RabbitMQ queues stay small enough to follow RabbitMQ's own operational guidance. Emergency shedding is trivial: low-priority events sit on Nakadi and can be discarded or skipped via source-side configuration.
What makes it distinctive¶
- Event-driven load shedding vs HTTP-ingress load shedding. Zalando had pre-existing load-shedding in Skipper for HTTP traffic, but Skipper's mechanism doesn't apply to Nakadi-event-triggered work. The communication platform is canonical for the event-bus boundary — admission control at the consumer, not at the HTTP edge.
- Per-event-type throttles with zero coordination. Every throttle uses local state only; there's no shared database or consensus. This is the property that lets AIMD scale from "one throttle" to "one per event type" without scaling the control-plane.
- Durable source as the architectural enabler. Nakadi's retention lets the platform shed into the upstream rather than drop — a property shared with any Kafka-backed or SQS-backed ingress but explicitly load-bearing here.
Operational numbers¶
- 1,000+ Nakadi event types trigger customer communication.
- 3-level priority table in the 2024 post: P1 (order confirmations), P2 (normal), P3 (commercial messages).
- ~6 months in production at publication time.
- ~300K messages in one P3 queue's Nakadi backlog during a load episode, while higher-priority queues near-empty — illustrates priority-differentiated shedding in practice.
Seen in¶
- Zalando — Enhancing Distributed System Load Shedding with TCP Congestion Control Algorithm (2024-04-22) — canonical architecture + load-shedding-design post for the Communication Platform. Opens with "Our team is responsible for sending out communications to all our customers at Zalando — e.g. confirming a placed order, informing about new content from a favourite brand or announcing sales campaigns."
Related¶
- systems/zalando-stream-consumer — the microservice where AIMD admission control lives.
- systems/nakadi — the ingress event bus / overflow buffer.
- systems/rabbitmq — internal message-broker backbone.
- systems/skipper-proxy — the HTTP-side load-shedder that motivated the design-space search.
- concepts/additive-increase-multiplicative-decrease-aimd
- concepts/load-shedding-at-ingestion
- concepts/publish-latency-as-congestion-signal
- concepts/per-priority-aimd-coefficients
- concepts/critical-business-operation — order confirmations as the canonical P1 business operation.
- concepts/service-level-objective
- patterns/aimd-ingestion-rate-control
- patterns/priority-differentiated-load-shedding
- patterns/source-queue-as-overflow-buffer
- companies/zalando