PATTERN Cited by 1 source
KPI closed-loop load ramp-up¶
What this is¶
KPI closed-loop load ramp-up is the pattern of driving a load test's ramp schedule by a closed-loop feedback controller that targets a business KPI (orders-per-minute, checkouts-per-second), not a technical quantity (concurrent users, requests-per-second). Every control cycle the controller re-measures the actual KPI, recomputes the users-per-KPI ratio from observation, and adjusts the load generator's worker count + spawn rate to close the gap.
Opposite of the more common open-loop fixed-ramp approach: "spawn N users over T minutes then hold", which schedules concurrency but produces an unknown KPI shape.
The concrete algorithm (Zalando instantiation)¶
From sources/2021-03-01-zalando-building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes, reproduced as the conductor's 60-second loop:
target_opm = <configured>
interval = 60s
iterations = test_duration / interval
users = 1
while time_left > 0:
status = locust.status() # users, orders-seen
current_opm = delta_orders(last 60s)
if status.users == 0:
# initialising
locust.set(user_count=1, hatch_rate=1)
elif current_opm == 0:
# stalled
locust.set(user_count=1, hatch_rate=1)
alert("load test stalled")
else:
users_per_opm = status.users / current_opm
required_users = users_per_opm * target_opm
to_spawn = required_users - status.users
iters_left = time_left / interval
this_iter_users = to_spawn / iters_left
hatch_rate = this_iter_users / interval
locust.set(user_count=required_users, hatch_rate=hatch_rate)
sleep(interval)
Why KPI, not users¶
The ratio of concurrent users to business-KPI throughput is not constant under load:
- Latency grows → fewer orders per user per minute.
- Retries multiply → more concurrent requests per order.
- Queueing stretches tail latency → concurrent-user budgets blow out without additional throughput.
An open-loop ramp of "users over time" produces an arbitrary KPI shape and often under-loads the system when it needs to be at peak for the test to produce signal.
Closed-loop-on-KPI self-corrects: when the ratio worsens, the controller spawns more users to keep target KPI on schedule — which is exactly what the real peak event would demand.
Why closed-loop, not one-shot¶
A one-shot calculation ("10 users gives us 1 order/minute, so spawn 1000 users for 100 orders/minute") bakes in one observation of the users-per-KPI ratio. The moment the system degrades (and it will — that's the point of load-testing), the bakedin ratio is wrong. Closed-loop re-measures continuously and has no fixed assumption about the ratio.
Stall detection + reset-to-safe¶
Two explicit guards in Zalando's pseudocode:
- user_count == 0 → "load test is being initialized" — reset hatch_rate + user_count to 1.
- current_opm == 0 while users > 0 → "load test stalled" — reset hatch_rate + user_count to 1 and alert.
The pattern: on stuck states, reset rather than compound — the closed-loop controller without guards will happily spawn more and more users against a silent system.
Required inputs¶
- A business KPI the system reports per interval. For Zalando, orders-per-minute is observable from Locust + the payment system's side. The interval must be short enough (1 minute at Zalando) that the feedback lag doesn't destabilise the ramp.
- A load-generator API that accepts runtime hatch-rate + user-count updates. Locust supports this natively; custom harnesses need to expose the same.
- Smoothing tolerance for noisy KPIs. Orders-per-minute is intrinsically noisy; the 60s interval at Zalando is the authors' chosen trade-off between responsiveness and oscillation.
Limitations¶
- Unreachable targets push to failure. If no amount of users can produce the target KPI, the controller keeps spawning users forever. Target feasibility needs to be separately estimated; abort criteria layered on top.
- Cascading dependencies can mask KPI. If orders-per-minute is reported only post-payment-confirmation, and payment confirmation is async, the signal lags the load and the controller oscillates.
- Doesn't exercise peak-above-target failure modes. A pure KPI-driven controller aims at target and no higher. Explicit over-shoot load tests need a different mode.
Seen in¶
- sources/2021-03-01-zalando-building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes — Zalando Payments' end-to-end Cyber-Week load test. Target KPI = orders-per-minute. 60-second control interval. The Load Test Conductor pushes new hatch rates + user counts to the Locust controller API every cycle.
Related¶
- concepts/kpi-driven-load-ramp-up — the concept form.
- patterns/declarative-load-test-conductor — the orchestrator that hosts this algorithm.
- systems/locust — the traffic generator whose API the controller drives.
- patterns/feedback-control-loop-for-rollouts — structural kin: closed-loop control at a different altitude (fleet rollouts), with the same reset-to-safe-on-stall discipline.