Skip to content

SYSTEM Cited by 1 source

Zalando Throughput Calculator

Throughput Calculator is Zalando's internal tool for projecting per-downstream-service request fan-out from a projected top-level CBO rate, using live distributed- tracing data as the substrate. Given "we expect X checkouts/min during Cyber Week", it computes the implied RPS on every downstream service that the checkout trace traverses. Built to support Cyber Week load-test capacity planning and dependency-aware load-test targeting.

Origin

Built by the Zalando SRE team in 2019 as one of four 2019 Distributed-Tracing-derived platform capabilities (alongside Adaptive Paging, the SLO Reporting Tool, and Operation-Based SLO rollout).

What it does

"A Throughput Calculator based on Tracing data is also developed that helped the Load Test efforts for Cyber Week preparations. By applying the expected throughput for a CBO, we could estimate the impact on all the components that are part of the same journey, usually through cascading remote procedure calls."sources/2021-09-20-zalando-tracing-sres-journey-part-ii

The calculation model:

  1. Sample production traces for a given CBO (e.g., checkout) over a representative time window.
  2. Build the fan-out matrix — for each trace, count how many times each downstream service is called as a share of the root-level CBO. Typical checkout might look like: inventory × 5, pricing × 3, payment × 1, fraud × 1, notification × 2, etc.
  3. Given a projected CBO rate (e.g., 15,000 checkouts/min at Cyber Week peak), multiply by the fan-out coefficients to project each downstream service's expected RPS.
  4. Surface the projection as a dashboard or a load-test target table.

The key substrate is the production tracing data — fan-out coefficients are measured, not assumed. This avoids the classic capacity-planning anti-pattern where teams document their service's expected dependency pattern and the documentation drifts from reality.

Why it's load-bearing

  • Cyber Week traffic is driven by top-of-funnel user behavior, not directly by individual-service load. "Inventory service gets 45,000 RPS at peak" is a consequence of "checkout rate is 15,000/min at peak" via the fan-out matrix — projecting the downstream from the topside is the natural direction.
  • Load test configuration requires the dependency topology. To test the inventory service at its projected peak, the load-test framework must know the implied RPS — which depends on the CBO projection multiplied by fan-out. Load Test Conductor consumes these projections as inputs.
  • Fan-out coefficients drift. A new feature adds a downstream call, or a retry policy changes the multiplier. Measuring from production tracing keeps the coefficients current.
  • Cascading coupling visible. The tool surfaces which downstream services are most strongly coupled to CBO rate increases — a load-increase on checkout doesn't just hit checkout's service but N downstream services, and the matrix makes that explicit.

Prerequisites

  • Fleet-wide distributed tracing — span-level RPC relationships visible across every service in the CBO's trace path. OpenTracing deployed with a high enough sample rate to give statistically meaningful fan-out coefficients.
  • A named CBO catalogue — the top-level entity whose projected rate is multiplied.
  • Tracing API exposing traces by CBO + time — same API that powers Adaptive Paging.
  • Projection inputs — expected CBOs/min derived from historical growth + campaign forecasts; usually provided by the product / marketing team.

Interaction with Load Test Conductor

The Throughput Calculator provides per-service projected RPS; Load Test Conductor consumes those targets and drives Locust workers to produce the traffic. The Calculator is the projection layer; the Conductor is the traffic- generation layer. Together with the live production headers-based mock switching they form Zalando's end-to-end load-test stack.

Caveats

  • Sampling bias in fan-out coefficients. If production tracing is sampled (not every trace kept), the fan-out matrix is sampled too. Tail-sampling or error-biased sampling can skew the fan-out numbers away from the typical case.
  • Non-linear fan-out. Some services fan out in a way that depends on input (e.g., product search fan-out depends on query length). Simple multiplication misses this; the Calculator is a first-order approximation.
  • New CBOs. A CBO that hasn't yet run at projected scale may have under-explored downstream paths — the fan-out coefficients only reflect current traffic patterns.
  • Retry amplification. Retry policies cause the downstream RPS to be higher than the trace would suggest in the happy path, and lower in the all-retry-exhausted case.

Seen in

Last updated · 501 distilled / 1,218 read