Skip to content

SYSTEM Cited by 1 source

Zalando Communication Platform

Definition

The Zalando Communication Platform is the internal event-driven system that produces and dispatches every customer-facing communication Zalando sends — order confirmations, shipment updates, marketing pushes, brand-alert notifications, sales-campaign announcements — across push-notification and email delivery providers.

Architecture shape

The platform is a classic event-driven fan-in / fan-out:

  • Ingress: Nakadi event bus carrying 1,000+ event types published by different teams Zalando-wide. Each event type that should trigger a customer message has a dedicated Event Listener inside the platform's Stream Consumer microservice.
  • Internal backbone: RabbitMQ routes messages between the platform's many microservices — renderers, consent/ preference/blocklist checkers, eligibility checkers, template + tenant configuration stores.
  • Egress: external service providers for push + email delivery.

Capacity is Kubernetes-managed (horizontal scaling on CPU / memory / endpoint-calls / backlog metrics), but the platform as a whole is bounded by how quickly downstream providers will accept traffic, so spikes accumulate backlogs.

Priority constraint

Not all communications are equal. Business stakeholders require that SLO-critical communications (order confirmations — Zalando's canonical CBO) process on-time even when low-priority traffic (marketing, brand alerts) is backed up. The platform therefore has explicit priority tiers exposed to event-type owners — P1 (critical), P2 (normal), P3 (bulk) in the 2024 post's example — and the ingress-rate control system respects them via per-priority AIMD coefficients (Source: sources/2024-04-22-zalando-enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm).

The load-shedding design (2024)

The 2024 post documents the platform's adoption of TCP- congestion-control-style ingress rate control inside the Stream Consumer:

  • A Statistics Collector cron samples RabbitMQ P50 publish latency + publish-exception counts.
  • A Congestion Detector compares those to thresholds and emits a binary "congested / not congested" decision.
  • Per-event-type Throttle instances — one per Event Listener — run AIMD on their batch-size state variable with priority- specific coefficients so high-priority event types speed up faster and slow down less.

Un-consumed events remain on Nakadi (Kafka-backed, durable), which becomes the overflow buffer — RabbitMQ queues stay small enough to follow RabbitMQ's own operational guidance. Emergency shedding is trivial: low-priority events sit on Nakadi and can be discarded or skipped via source-side configuration.

What makes it distinctive

  • Event-driven load shedding vs HTTP-ingress load shedding. Zalando had pre-existing load-shedding in Skipper for HTTP traffic, but Skipper's mechanism doesn't apply to Nakadi-event-triggered work. The communication platform is canonical for the event-bus boundary — admission control at the consumer, not at the HTTP edge.
  • Per-event-type throttles with zero coordination. Every throttle uses local state only; there's no shared database or consensus. This is the property that lets AIMD scale from "one throttle" to "one per event type" without scaling the control-plane.
  • Durable source as the architectural enabler. Nakadi's retention lets the platform shed into the upstream rather than drop — a property shared with any Kafka-backed or SQS-backed ingress but explicitly load-bearing here.

Operational numbers

  • 1,000+ Nakadi event types trigger customer communication.
  • 3-level priority table in the 2024 post: P1 (order confirmations), P2 (normal), P3 (commercial messages).
  • ~6 months in production at publication time.
  • ~300K messages in one P3 queue's Nakadi backlog during a load episode, while higher-priority queues near-empty — illustrates priority-differentiated shedding in practice.

Seen in

Last updated · 550 distilled / 1,221 read