SYSTEM Cited by 1 source

Zalando Communication Platform¶

Definition¶

The Zalando Communication Platform is the internal event-driven system that produces and dispatches every customer-facing communication Zalando sends — order confirmations, shipment updates, marketing pushes, brand-alert notifications, sales-campaign announcements — across push-notification and email delivery providers.

Architecture shape¶

The platform is a classic event-driven fan-in / fan-out:

Ingress: Nakadi event bus carrying 1,000+ event types published by different teams Zalando-wide. Each event type that should trigger a customer message has a dedicated Event Listener inside the platform's Stream Consumer microservice.
Internal backbone: RabbitMQ routes messages between the platform's many microservices — renderers, consent/ preference/blocklist checkers, eligibility checkers, template + tenant configuration stores.
Egress: external service providers for push + email delivery.

Capacity is Kubernetes-managed (horizontal scaling on CPU / memory / endpoint-calls / backlog metrics), but the platform as a whole is bounded by how quickly downstream providers will accept traffic, so spikes accumulate backlogs.

Priority constraint¶

Not all communications are equal. Business stakeholders require that SLO-critical communications (order confirmations — Zalando's canonical CBO) process on-time even when low-priority traffic (marketing, brand alerts) is backed up. The platform therefore has explicit priority tiers exposed to event-type owners — P1 (critical), P2 (normal), P3 (bulk) in the 2024 post's example — and the ingress-rate control system respects them via per-priority AIMD coefficients (Source: sources/2024-04-22-zalando-enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm).

The load-shedding design (2024)¶

The 2024 post documents the platform's adoption of TCP- congestion-control-style ingress rate control inside the Stream Consumer:

A Statistics Collector cron samples RabbitMQ P50 publish latency + publish-exception counts.
A Congestion Detector compares those to thresholds and emits a binary "congested / not congested" decision.
Per-event-type Throttle instances — one per Event Listener — run AIMD on their batch-size state variable with priority- specific coefficients so high-priority event types speed up faster and slow down less.

Un-consumed events remain on Nakadi (Kafka-backed, durable), which becomes the overflow buffer — RabbitMQ queues stay small enough to follow RabbitMQ's own operational guidance. Emergency shedding is trivial: low-priority events sit on Nakadi and can be discarded or skipped via source-side configuration.

What makes it distinctive¶

Event-driven load shedding vs HTTP-ingress load shedding. Zalando had pre-existing load-shedding in Skipper for HTTP traffic, but Skipper's mechanism doesn't apply to Nakadi-event-triggered work. The communication platform is canonical for the event-bus boundary — admission control at the consumer, not at the HTTP edge.
Per-event-type throttles with zero coordination. Every throttle uses local state only; there's no shared database or consensus. This is the property that lets AIMD scale from "one throttle" to "one per event type" without scaling the control-plane.
Durable source as the architectural enabler. Nakadi's retention lets the platform shed into the upstream rather than drop — a property shared with any Kafka-backed or SQS-backed ingress but explicitly load-bearing here.

Operational numbers¶

1,000+ Nakadi event types trigger customer communication.
3-level priority table in the 2024 post: P1 (order confirmations), P2 (normal), P3 (commercial messages).
~6 months in production at publication time.
~300K messages in one P3 queue's Nakadi backlog during a load episode, while higher-priority queues near-empty — illustrates priority-differentiated shedding in practice.

Seen in¶

Zalando — Enhancing Distributed System Load Shedding with TCP Congestion Control Algorithm (2024-04-22) — canonical architecture + load-shedding-design post for the Communication Platform. Opens with "Our team is responsible for sending out communications to all our customers at Zalando — e.g. confirming a placed order, informing about new content from a favourite brand or announcing sales campaigns."

systems/zalando-stream-consumer — the microservice where AIMD admission control lives.
systems/nakadi — the ingress event bus / overflow buffer.
systems/rabbitmq — internal message-broker backbone.
systems/skipper-proxy — the HTTP-side load-shedder that motivated the design-space search.
concepts/additive-increase-multiplicative-decrease-aimd
concepts/load-shedding-at-ingestion
concepts/publish-latency-as-congestion-signal
concepts/per-priority-aimd-coefficients
concepts/critical-business-operation — order confirmations as the canonical P1 business operation.
concepts/service-level-objective
patterns/aimd-ingestion-rate-control
patterns/priority-differentiated-load-shedding
patterns/source-queue-as-overflow-buffer
companies/zalando