Skip to content

PATTERN Cited by 1 source

KPI closed-loop load ramp-up

What this is

KPI closed-loop load ramp-up is the pattern of driving a load test's ramp schedule by a closed-loop feedback controller that targets a business KPI (orders-per-minute, checkouts-per-second), not a technical quantity (concurrent users, requests-per-second). Every control cycle the controller re-measures the actual KPI, recomputes the users-per-KPI ratio from observation, and adjusts the load generator's worker count + spawn rate to close the gap.

Opposite of the more common open-loop fixed-ramp approach: "spawn N users over T minutes then hold", which schedules concurrency but produces an unknown KPI shape.

The concrete algorithm (Zalando instantiation)

From sources/2021-03-01-zalando-building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes, reproduced as the conductor's 60-second loop:

target_opm     = <configured>
interval       = 60s
iterations     = test_duration / interval
users          = 1
while time_left > 0:
    status = locust.status()            # users, orders-seen
    current_opm = delta_orders(last 60s)
    if status.users == 0:
        # initialising
        locust.set(user_count=1, hatch_rate=1)
    elif current_opm == 0:
        # stalled
        locust.set(user_count=1, hatch_rate=1)
        alert("load test stalled")
    else:
        users_per_opm   = status.users / current_opm
        required_users  = users_per_opm * target_opm
        to_spawn        = required_users - status.users
        iters_left      = time_left / interval
        this_iter_users = to_spawn / iters_left
        hatch_rate      = this_iter_users / interval
        locust.set(user_count=required_users, hatch_rate=hatch_rate)
    sleep(interval)

Why KPI, not users

The ratio of concurrent users to business-KPI throughput is not constant under load:

  • Latency grows → fewer orders per user per minute.
  • Retries multiply → more concurrent requests per order.
  • Queueing stretches tail latency → concurrent-user budgets blow out without additional throughput.

An open-loop ramp of "users over time" produces an arbitrary KPI shape and often under-loads the system when it needs to be at peak for the test to produce signal.

Closed-loop-on-KPI self-corrects: when the ratio worsens, the controller spawns more users to keep target KPI on schedule — which is exactly what the real peak event would demand.

Why closed-loop, not one-shot

A one-shot calculation ("10 users gives us 1 order/minute, so spawn 1000 users for 100 orders/minute") bakes in one observation of the users-per-KPI ratio. The moment the system degrades (and it will — that's the point of load-testing), the bakedin ratio is wrong. Closed-loop re-measures continuously and has no fixed assumption about the ratio.

Stall detection + reset-to-safe

Two explicit guards in Zalando's pseudocode:

  • user_count == 0"load test is being initialized" — reset hatch_rate + user_count to 1.
  • current_opm == 0 while users > 0"load test stalled" — reset hatch_rate + user_count to 1 and alert.

The pattern: on stuck states, reset rather than compound — the closed-loop controller without guards will happily spawn more and more users against a silent system.

Required inputs

  • A business KPI the system reports per interval. For Zalando, orders-per-minute is observable from Locust + the payment system's side. The interval must be short enough (1 minute at Zalando) that the feedback lag doesn't destabilise the ramp.
  • A load-generator API that accepts runtime hatch-rate + user-count updates. Locust supports this natively; custom harnesses need to expose the same.
  • Smoothing tolerance for noisy KPIs. Orders-per-minute is intrinsically noisy; the 60s interval at Zalando is the authors' chosen trade-off between responsiveness and oscillation.

Limitations

  • Unreachable targets push to failure. If no amount of users can produce the target KPI, the controller keeps spawning users forever. Target feasibility needs to be separately estimated; abort criteria layered on top.
  • Cascading dependencies can mask KPI. If orders-per-minute is reported only post-payment-confirmation, and payment confirmation is async, the signal lags the load and the controller oscillates.
  • Doesn't exercise peak-above-target failure modes. A pure KPI-driven controller aims at target and no higher. Explicit over-shoot load tests need a different mode.

Seen in

Last updated · 476 distilled / 1,218 read