PATTERN

KPI closed-loop load ramp-up¶

What this is¶

KPI closed-loop load ramp-up is the pattern of driving a load test's ramp schedule by a closed-loop feedback controller that targets a business KPI (orders-per-minute, checkouts-per-second), not a technical quantity (concurrent users, requests-per-second). Every control cycle the controller re-measures the actual KPI, recomputes the users-per-KPI ratio from observation, and adjusts the load generator's worker count + spawn rate to close the gap.

Opposite of the more common open-loop fixed-ramp approach: "spawn N users over T minutes then hold", which schedules concurrency but produces an unknown KPI shape.

The concrete algorithm (Zalando instantiation)¶

From , reproduced as the conductor's 60-second loop:

target_opm     = <configured>
interval       = 60s
iterations     = test_duration / interval
users          = 1
while time_left > 0:
    status = locust.status()            # users, orders-seen
    current_opm = delta_orders(last 60s)
    if status.users == 0:
        # initialising
        locust.set(user_count=1, hatch_rate=1)
    elif current_opm == 0:
        # stalled
        locust.set(user_count=1, hatch_rate=1)
        alert("load test stalled")
    else:
        users_per_opm   = status.users / current_opm
        required_users  = users_per_opm * target_opm
        to_spawn        = required_users - status.users
        iters_left      = time_left / interval
        this_iter_users = to_spawn / iters_left
        hatch_rate      = this_iter_users / interval
        locust.set(user_count=required_users, hatch_rate=hatch_rate)
    sleep(interval)

Why KPI, not users¶

The ratio of concurrent users to business-KPI throughput is not constant under load:

Latency grows → fewer orders per user per minute.
Retries multiply → more concurrent requests per order.
Queueing stretches tail latency → concurrent-user budgets blow out without additional throughput.

An open-loop ramp of "users over time" produces an arbitrary KPI shape and often under-loads the system when it needs to be at peak for the test to produce signal.

Closed-loop-on-KPI self-corrects: when the ratio worsens, the controller spawns more users to keep target KPI on schedule — which is exactly what the real peak event would demand.

Why closed-loop, not one-shot¶

A one-shot calculation ("10 users gives us 1 order/minute, so spawn 1000 users for 100 orders/minute") bakes in one observation of the users-per-KPI ratio. The moment the system degrades (and it will — that's the point of load-testing), the bakedin ratio is wrong. Closed-loop re-measures continuously and has no fixed assumption about the ratio.

Stall detection + reset-to-safe¶

Two explicit guards in Zalando's pseudocode:

user_count == 0 → "load test is being initialized" — reset hatch_rate + user_count to 1.
current_opm == 0 while users > 0 → "load test stalled" — reset hatch_rate + user_count to 1 and alert.

The pattern: on stuck states, reset rather than compound — the closed-loop controller without guards will happily spawn more and more users against a silent system.

Required inputs¶

A business KPI the system reports per interval. For Zalando, orders-per-minute is observable from Locust + the payment system's side. The interval must be short enough (1 minute at Zalando) that the feedback lag doesn't destabilise the ramp.
A load-generator API that accepts runtime hatch-rate + user-count updates. Locust supports this natively; custom harnesses need to expose the same.
Smoothing tolerance for noisy KPIs. Orders-per-minute is intrinsically noisy; the 60s interval at Zalando is the authors' chosen trade-off between responsiveness and oscillation.

Limitations¶

Unreachable targets push to failure. If no amount of users can produce the target KPI, the controller keeps spawning users forever. Target feasibility needs to be separately estimated; abort criteria layered on top.
Cascading dependencies can mask KPI. If orders-per-minute is reported only post-payment-confirmation, and payment confirmation is async, the signal lags the load and the controller oscillates.
Doesn't exercise peak-above-target failure modes. A pure KPI-driven controller aims at target and no higher. Explicit over-shoot load tests need a different mode.

Seen in¶

— Zalando Payments' end-to-end Cyber-Week load test. Target KPI = orders-per-minute. 60-second control interval. The Load Test Conductor pushes new hatch rates + user counts to the Locust controller API every cycle.

concepts/kpi-driven-load-ramp-up — the concept form.
patterns/declarative-load-test-conductor — the orchestrator that hosts this algorithm.
systems/locust — the traffic generator whose API the controller drives.
patterns/feedback-control-loop-for-rollouts — structural kin: closed-loop control at a different altitude (fleet rollouts), with the same reset-to-safe-on-stall discipline.