Skip to content

ZALANDO 2021-03-01

Read original ↗

Zalando — Building an End to End load test automation system on top of Kubernetes

Summary

Zalando's Payments department (Alexander Yevsyukov & Murat Kilic, 2021-03-01) built an end-to-end load test automation system on top of Kubernetes to de-risk Cyber Week for the checkout + payments microservice landscape. The system has three named parts: a Locust-based distributed traffic generator packaged as a container, a Hoverfly mocking layer for external dependencies, and a Go microservice called the Load Test Conductor that owns the full load-test lifecycle via a declarative API. The conductor clones production's deployed versions and resource allocation into a dedicated test cluster, scales applications in Kubernetes and AWS ECS to match production, steers Locust workers via a closed-loop algorithm keyed to orders-per-minute, and scales everything back down after the run. Tests are invoked manually via the conductor's API or triggered on a schedule by a Kubernetes CronJob. Runs take ~2 hours for the Payment system. Evaluation is human-in-the-loop: a developer reads Grafana dashboards (latency / throughput / response-code rate) and SLO-breach alerts fire automatically.

The distinguishing architectural choices that anchor this post on the wiki are: (1) a declarative, Kubernetes-inspired API for load tests — the client describes the desired state of a load test (target orders-per-minute, ramp-up time, plateau time) rather than the imperative steps; (2) a closed-loop worker-count algorithm that recomputes Locust's hatch rate and user count every 60 seconds from the actually-observed orders-per-minute versus the target, instead of assuming a fixed users→orders ratio; (3) header-based routing via Skipper to let a single service instance dynamically switch between the real external dependency and its Hoverfly mock based on whether the incoming request carries a load-test tag — this is the substrate that lets load tests and real tests share the test cluster without cross-contamination; (4) the "test cluster where we can break things" framing — this load test runs in a non-production environment deliberately pushed to failure, and is described as complementary to in-production load tests, not a replacement — confirming Zalando's earlier live-load-test-in-production discipline while opening a sibling pattern for pre-prod capacity-discovery.

Operational numbers disclosed: ~2 hours per Payment load-test run; 60-second algorithm interval for the orders-per-minute closed loop; NodePool A for test infrastructure + NodePool B for Payment services + ECS for non-Kubernetes components (the three deployment substrates the conductor orchestrates simultaneously). The post does not disclose the target orders-per-minute, absolute throughput, or error-rate thresholds.

Operational gaps: no disclosure of worker count at peak, no p99 latency numbers, no breakdown of the Payment microservice count or call graph, no SLO numeric thresholds. Test evaluation is manual — the authors name this as an acceptable limitation "sufficient for us for the time being". External dependency mocking scope is enumerated qualitatively but without a full list.

Key takeaways

  • Declarative API for load tests. The Load Test Conductor exposes a simple API where the client declares the desired state of a load test — target orders-per-minute, ramp-up time, plateau time, which applications to include — and the conductor drives all phases (deploy, scale up, execute, scale down, clean up) to that state. "Executing a load test is now just one API call away!" The architectural inspiration is explicit: "Our service design was heavily influenced by what Kubernetes popularized for infrastructure management. We wanted our system to be a declarative system." (Source: this article.)

  • KPI-driven closed-loop ramp-up, not fixed users→orders mapping. The hatch-rate algorithm runs on a 60-second cadence: get current Locust status + current orders-per-minute → compute total users needed to hit the target orders-per- minute using the currently observed users-per-minute- orders ratio → compute users-to-spawn in this iteration → compute hatch rate → push to Locust API. This self-corrects for drift in the user-to-order conversion rate (driven by latency, retries, queueing) that a fixed ratio would miss. Stall detection is built in: if orders-per-minute is zero, reset hatch rate to 1 and user count to 1 and report "load test stalled". (Source: this article, "Load generation pseudo code".)

  • Header-based routing via Skipper switches real vs mock dependency per request. Services in the test cluster can be shared between the load test and other tests. To avoid cross-contamination, each service inspects the incoming request's headers (the load-test tag) and routes either to the real external dependency or to its Hoverfly mock. The Skipper ingress layer carries the routing rule — a single L7 routing primitive replaces a second deployment of the service. (Source: this article.)

  • Production-version cloning: the Deployer + Scaler subcomponents. The conductor has a Deployer component that queries Zalando's Continuous Delivery Platform via the Kubernetes client to find the exact versions deployed in production, then triggers deployments of those versions in the test cluster. The Scaler component then scales the test cluster's applications to match production's resource allocation and replica count. Scaling supports both Kubernetes and AWS ECS environments. After the load test completes (or fails), the Scaler reverts to the pre-test state as a cost-saving measure. (Source: this article, "Deployment and Scaling".)

  • Locust chosen over Vegeta / JMeter on developer-familiarity, not technical superiority. The post names Locust, Vegeta, and JMeter as the initial shortlist; filters JMeter out because "JMeter not being popular internally", then chooses Locust over Vegeta because "it was more popular within our development teams, thus the test suite would be easier to maintain." This is a clean instance of developer-familiarity beating technical optimality in load-test tooling selection. (Source: this article.)

  • **Hoverfly over Wiremock / MockServer on stateful + recording

  • language-agnosticism.** The post includes a 5-tool comparison table (Mobtest, Wiremock, Mockserver, Mokoon, Hoverfly) across 10 dimensions. Hoverfly is picked for (a) record-and-replay support — only Mockserver and Hoverfly have it; (b) stateful behavior via key-value map; (c) language-agnosticism (Go binary with HTTP API). The config is a static JSON simulation file started via hoverfly -webserver -import simulation.json. (Source: this article, "Mock External Dependencies".)

  • Test cluster as "break things" environment; complementary to live-prod load testing, not replacement. Authors are explicit: "Since these load tests are conducted in a non- production environment, we could stress the services till they fail. In combination with load tests in production, this was essential for preparing our production services for higher load." This positions the conductor as a sibling discipline to the live-prod load-testing described in sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week, not a substitute. The break-things cluster surfaces failures that live-prod load-tests cannot produce (because customer impact is the abort condition in prod). (Source: this article.)

  • First quality gate + CronJob-scheduled + developer-API- triggered. Load tests run on a cron schedule via a Kubernetes CronJob that calls the conductor's API. Developers can also trigger runs manually via the same API. The combined workflow makes the load test "the first quality gate of the Payment system" — it happens regularly and repeatably, not only before Cyber Week. Per-feature-branch manual invocation (via the production-version-deploy flag made optional) lets developers load-test their changes. (Source: this article.)

  • Operational friction called out honestly. Three pain points: (1) concurrent deployments during load tests caused services to point to under-resourced pods (a race between the conductor's scaler and unrelated CI deployments); (2) infrastructure-layer components like cluster node type, databases, and centrally managed event queues (Nakadi) had to be adjusted for production-parity, requiring cross-team alignment; (3) manual result evaluation via Grafana is acknowledged as "sufficient for the time being" — automation is explicitly deferred. (Source: this article, "Conclusions".)

Architecture in more detail

Load Test Conductor internals

Go microservice. API-driven. Owns the full lifecycle defined as:

  1. Deploy production versions into test cluster — Deployer component queries the CDP for the currently-deployed production artifacts, triggers deployments, waits for completion.
  2. Scale applications to production resource/replica configuration — Scaler component, Kubernetes client + AWS ECS API.
  3. Generate load via Locust — Conductor calls Locust API to set hatch rate + user count; runs the KPI closed-loop algorithm every 60s.
  4. Scale applications down to pre-test state — cost saving.
  5. Clean up test data / databases — remove orders created by the simulator so the state does not drift.

Load generation algorithm (reproduced pseudocode)

set initial number of users to 1
set calculation interval to 60 seconds
while load test time has not exceeded
    get locust status
    calculate orders per defined calculation interval
    calculate orders per minute
    if user count in locust status is equal to zero
        print "load test is being initialized."
        set loadtest hatch rate to one
        set loadtest user count to initial number of users
        set loadtest orders per minute to 0
    else if orders per minute equal to zero
        print "load test stalled due to no orders getting
               generated."
        set loadtest hatch rate to one
        set loadtest user count to one
    else
        calculate total users needed to achieve target
          orders per minute using current locust users per
          minute rate and orders per minute rate.
        calculate users that needs to be created.
        calculate time left for the load test.
        calculate iterations left for the load test.
        calculate users to spawn in this iteration.
        calculate hatchrate
        set loadtest hatch rate to calculated hatchrate
        set loadtest user count to calculated users
    update locust with load test parameters, this triggers
      load generation.
    sleep for calculation interval time.

Substrate layout

  • NodePool A — the load-testing system itself (Locust controller + workers, Load Test Conductor, Hoverfly mocks).
  • NodePool B — the Payments microservices under test.
  • AWS ECS — non-Kubernetes components of the Payment platform.

The conductor orchestrates all three simultaneously — a single declarative load-test run fans out scaling actions across two Kubernetes node pools and an ECS cluster.

Dependency mock switching

Header propagation: every simulator-generated HTTP request is tagged with a load-test header. Services in the test cluster have Skipper filter chains that read this header and route to either the real dependency or to a Hoverfly mock (hoverfly -webserver -import simulation.json). The mock simulation file (JSON) declares request matchers (path, method) and templated responses (status, body, headers). Hoverfly's template engine supports dynamic fields like current datetime.

Monitoring + evaluation

Grafana dashboards show latency, throughput, and response-code rates. Human operator reads graphs to decide pass/fail. Automated alerts fire on SLO breach during test execution.

Caveats / limitations

  • Test evaluation is manual. No automated pass/fail gate on aggregated metrics. Authors say this is sufficient for now.
  • Post-test cleanup scope is not fully specified. The conductor deletes orders created by the simulator, but downstream state (payment records, audit logs, Nakadi events) cleanup is not described.
  • Test cluster ≠ production. Infrastructure-level parity (node types, databases, event queues) required cross-team effort to achieve. Some gaps may persist.
  • No numeric targets disclosed. Target orders-per-minute, p99 latency, error-rate thresholds are not in the post.
  • No disclosure of production throughput. The post establishes the pattern without quantifying the Cyber Week traffic it simulates.

Source

Last updated · 476 distilled / 1,218 read