Skip to content

ZALANDO 2024-04-22

Read original ↗

Zalando — Enhancing Distributed System Load Shedding with TCP Congestion Control Algorithm

Summary

A 2024-04-22 post from Zalando's customer-communications team describing how they added ingestion-rate control to their event-driven communications platform by borrowing TCP's Additive Increase Multiplicative Decrease (AIMD) algorithm. The platform sends every customer-facing message Zalando produces (order confirmations, marketing, brand alerts) triggered by 1,000+ Nakadi event types; downstream message preparation and fan-out to external providers are capacity-bounded, so backlogs build periodically. Business stakeholders require that SLO-critical communications (order confirmations) process on-time even when lower-priority traffic (marketing) is backed up. Pre-existing Skipper load-shedding doesn't apply because the traffic arrives via Nakadi events, not HTTP. The solution: throttle event consumption at the Stream Consumer (the microservice that consumes Nakadi and publishes to RabbitMQ) using per-event-type AIMD instances driven by two RabbitMQ publish-side signals — publish latency (P50) and publish errors — with per-priority additive-increase and multiplicative-decrease coefficients so high-priority event types speed up faster and slow down less. The mechanism needs no central coordination (each throttle is local) and exploits Nakadi's durable Kafka-backed log as the overflow buffer, so RabbitMQ queues stay small enough to follow the broker's own operational guidance. Six months in production: RabbitMQ queue depth down meaningfully, order confirmation processing time stable during load spikes while commercial-message processing time rises (acceptable trade-off), and discard-by-priority becomes trivial because lower-priority events simply sit in the Nakadi backlog.

The post is a crisp example of a cross-domain algorithm transfer: TCP's congestion-control primitive, designed for per-connection byte budgets on a packet-switched network, mapped onto per-event-type batch-size budgets on an event bus. The abstraction that enables the transfer is the congestion signal itself — TCP uses packet loss and ACK delay; the Zalando design uses publish errors and publish P50 latency. The algorithm, the per-priority coefficient table, and the observer-pattern wiring between a statistics collector, a congestion detector, and per-reader throttles are reusable wherever "the system has downstream capacity signals and the ingress is selectable."

Key takeaways

  1. A capacity-bounded event-driven system without ingest control degrades all traffic equally during spikes. Zalando's communications platform consumes 1,000+ Nakadi event types that trigger customer communications; some (order confirmations) have hard SLOs, most don't. Without ingest control, a traffic spike on low-priority event types (marketing campaigns, brand alerts) pushes messages into RabbitMQ backlogs that block all consumers — so P1 traffic ends up queued behind P3. The business constraint isn't "process everything fast"; it's "process SLO-protected operations on-time, let the rest wait." (Source: sources/2024-04-22-zalando-enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm)

  2. TCP's AIMD is the right algorithm for rate-adapting to opaque downstream capacity. AIMD's attractive properties for ingest control (from the post's design rationale): reacts to downstream saturation signals (errors, increased latency) without needing a measured capacity number; adapts when capacity grows (e.g. pods scale out) because the additive-increase keeps probing; uses only local state per throttle instance, so there's no central coordinator or shared database. This is exactly the distributed-systems property that made AIMD successful for the internet — no router knows the whole network's capacity, and every TCP sender finds its own steady state.

  3. Publish latency and publish errors are the two signals Zalando chose for congestion. The RabbitMQ producer-side API exposes: (a) whether a publish succeeded or threw, (b) the round- trip time of the publish. RabbitMQ applies back-pressure to slow publishers when its consumers can't keep up (its own flow-control mechanism), and the publisher experiences that as increased publish time. So the single observable "P50 publish latency" fuses both RabbitMQ-internal saturation and the platform's own downstream pipeline saturation into one number the Congestion Detector can compare to a configured threshold (concepts/publish-latency-as-congestion-signal, patterns/multi-metric-throttling — latency + error combined).

  4. Per-priority coefficients turn one AIMD into a prioritization policy. Each event-type's throttle is seeded with priority-specific additive-increase and multiplicative-decrease constants. The example from the post: on speed-up signal, P1 += 15, P2 += 10, P3 += 5; on slow-down signal, P1 × 80%, P2 × 60%, P3 × 40%. Under load, low-priority throttles contract hard and fast while high-priority throttles barely move — so downstream capacity is re-allocated toward P1 automatically without any inter-throttle coordination (concepts/per-priority-aimd-coefficients, patterns/priority-differentiated-load-shedding).

  5. Nakadi's durable log serves as the overflow buffer, so RabbitMQ stays light. When throttles reduce consumption, un-consumed events sit on Nakadi (which is Kafka-backed and durable). They're not pushed into RabbitMQ queues, so the queues stay at sizes RabbitMQ actually performs well at. This is a canonical application of patterns/source-queue-as-overflow-buffer: accept backlog at the durable source; don't move it into the hot-path broker. The post explicitly calls out that RabbitMQ best practice is "keep queues short," and the AIMD mechanism is what makes that achievable even with bursty ingress.

  6. "Discard lowest priority in an emergency" becomes trivial once the backlog lives on the source. Because low-priority events accumulate in Nakadi, emergency shedding is a matter of configuring Nakadi subscriptions to skip ahead (or advancing their cursor past unwanted events), not a distributed delete across RabbitMQ queues ("Messages of lower priority can be discarded in case of emergency."). The post's diagrams show this playing out — a ~300K backlog on a P3 queue while P1 processing time stays flat; if the P3 backlog ever needs to be truncated, it's one source-side configuration change.

  7. Scoping the change to one microservice keeps blast radius small. The control-plane change lives entirely inside the Stream Consumer, the single Nakadi-to-RabbitMQ bridge. No change to downstream microservices, to message schemas, or to RabbitMQ topology. Because every throttle is local and uncoordinated, there's no new database to provision or consensus to operate. This is a general property of AIMD-based rate control and is one of the reasons TCP's design scales to the internet: the algorithm fits at one place in the pipeline and needs no shared state.

  8. Six-month production results: reduced queue depth, stable SLO operation, discards-by-priority made easy. The post reports (qualitatively, with supporting graphs): (a) RabbitMQ message counts are sharply lower — capacity in queues is reserved for what the platform can actually process; (b) order-confirmation processing time stays flat through load episodes while commercial- message processing time rises, "acceptable as this is a low priority use case"; (c) Nakadi backlogs grow and drain in place during load episodes without stressing the platform.

Architecture

System under load

  • Nakadi — distributed event bus (Kafka-backed with REST) publishing 1,000+ event types that trigger customer communication.
  • Stream Consumer — the microservice consuming Nakadi and publishing to RabbitMQ. One Event Listener instance per Nakadi event type.
  • RabbitMQ — the platform's internal message-broker backbone (routing between downstream services).
  • Platform microservices — render messages (push + email), check consent / preference / blocklist, check eligibility, manage templates / tenants. Connected via RabbitMQ exchanges + queues. Kubernetes-scaled on CPU / memory / endpoint-calls / backlog metrics.

Stream Consumer internals (the change)

Three collaborating components, wired by observer pattern:

  • Statistics Collector — cron job collecting P50 publish latency and publish-exception counts for RabbitMQ.
  • Congestion Detector — compares collected stats to configured thresholds; emits a binary "congested / not congested" decision; broadcasts to all throttles.
  • Throttle (one per Event Listener / event type) — implements AIMD. Holds the current batch-size state variable. On speed-up signal: size += increase[priority]. On slow-down signal: size *= decrease[priority]. No coordination with peer throttles.

Because Nakadi doesn't natively support dynamically changing consumption batch size, the Stream Consumer simulates it by pausing/resuming consumption to match the throttle's current target rate.

Operational numbers and diagrams

  • 1,000+ Nakadi event types trigger customer communication.
  • 3-level priority table in the post's example: P1 highest (order confirmations), P3 lowest (commercial messages).
  • Example coefficients (from the post):
  • Additive increase: P1 += 15, P2 += 10, P3 += 5.
  • Multiplicative decrease: P1 × 80%, P2 × 60%, P3 × 40%.
  • ~6 months in production at the time of writing.
  • ~300K messages shown sitting in a single P3-priority queue's Nakadi backlog while higher-priority queues are near-empty — illustrates priority-differentiated shedding working under real load.
  • RabbitMQ queue backlog graph: sharply reduced vs. pre-AIMD.
  • Order confirmation processing time: flat across load episodes.
  • Commercial message processing time: rises during load episodes (acceptable per SLO).

Concepts / systems / patterns extracted

Systems introduced / touched:

Concepts introduced / touched:

Patterns introduced / touched:

Caveats

  • The post presents the AIMD instance as a batch-size state variable with additive / multiplicative updates — not the full TCP cwnd machinery (no slow-start, no timeout-based reset). The naming is intentional but the implementation is a simplified congestion-control loop suitable for application-level rate adaptation.
  • Threshold tuning is hand-rolled. The Congestion Detector compares publish latency and error counts to values "configured in the service." No detail on how those thresholds were chosen or how often they're re- tuned. In practice this is the delicate knob — a wrong threshold either throttles too aggressively (hurts throughput even in healthy states) or too late (still floods RabbitMQ before throttling fires).
  • Priorities are discrete classes, not per-event priorities. Three levels in the example; each event type gets assigned one of them. Finer-grained prioritization (per-message, per-customer-tier) is not addressed.
  • No quantitative before/after numbers are given for queue depth, processing-time delta, or throughput. The post shows unlabeled time-series graphs that qualitatively support the claims; there's no p50/p95 table or throughput measurement.
  • Nakadi's durability assumption is load-bearing. The "discard from source in emergency" and "Nakadi holds the backlog" properties all rest on Nakadi's retention configuration being long enough. If the backlog exceeds retention, events are silently lost. The post doesn't cover how retention is sized relative to expected worst- case shedding episodes.
  • Nakadi doesn't natively support batch-size changes, so the Stream Consumer has to pause/resume consumption to approximate the throttle's rate. This is a concrete implementation quirk that other broker clients wouldn't inherit.

Source

Last updated · 501 distilled / 1,218 read