PATTERN

Live load test in production¶

Live load test in production is the ongoing discipline of running simulated user load against the real production system, with explicit traffic-source tagging, abort heuristics, and capacity-planning outputs. Distinct from pre-migration load testing (which validates a new platform) and from synthetic canaries (which validate health): this pattern is capacity discovery — finding the scalability limits of the current system in its real form.

When this pattern applies¶

Cloud-auto-scaling stacks where over-provisioning is not an option. Zalando's framing: "In a cloud-based system that relies heavily on auto-scaling for cost-optimization, proper testing and capacity planning is a must."
Large, interconnected systems where staging doesn't reflect prod. At 4,000+ applications (Zalando's fleet), no staging environment reproduces production's dependency graph, data shape, auto-scaler tuning, or cache state. Load tests run against anything smaller than prod test the test, not the system.
Annual / seasonal peak events where commercial goals require a stated peak-capacity commitment and the engineering org has to produce a capacity-planning number the business can bank.

The shape¶

Zalando's two simulator families (canonical instance):

1. Sales-order simulator¶

Places test sales orders on clearly-distinguishable test products.
Orders flow through inventory management + payment processing, get to fulfillment, then skip — the simulator deliberately stops before a physical pick / pack.
Primary measurement: scalability limits of the order- processing pipeline and every dependency in its call graph.

2. User-journey simulator¶

Drives key customer touchpoints (browse → detail → cart → checkout → confirmation) across all countries Zalando operates in.
Traffic shape mirrors sales-event patterns observed in prior years.
Primary measurement: end-to-end latency + error-rate envelope across the shop under sales-event load.

The three required guardrails¶

Running this pattern in production is only safe with all three:

a) Traffic-source tagging¶

Every simulator-generated request carries a load-test tag propagated via the tracing substrate — see concepts/traffic-source-tagging-in-traces. Without this, test traffic pollutes capacity-planning dashboards, triggers customer-impact alerts, and corrupts conversion metrics.

b) Abort-on-customer-impact¶

Zalando states the constraint directly: "Mistakes become really costly as the customer experience is degraded and thus this approach requires the ability to quickly notice customer impact and react by aborting the test or mitigating the incident otherwise." Concretely: a kill switch the operator can hit, plus an auto-abort on customer-impact metrics (real-user conversion, real-user latency) crossing thresholds.

c) Test-product / test-account isolation¶

Test orders are not real SKUs. The downstream systems (payment, inventory, fulfillment) either (a) recognize the test-product class and dry-run, or (b) have a hard invariant that filters test data at the fulfillment boundary. Getting this invariant wrong ships someone a test product.

The buy-vs-build transition¶

Zalando's own exit from in-house simulators is instructive:

"Having written and evolved the user journey simulator for two years we were not fully satisfied with its abilities to generate load at scale. There were too many rough edges and tuning the simulator to be able to generate the required load profiles and investing our development time was very time consuming. We decided that it's better to leverage an existing product that will do the job better. This paid off heavily as last year we were able to run the tests both on App and Web platforms simultaneously."

Two principles:

Build when you need the domain-specific integration (placing test orders with skipping-fulfillment semantics is Zalando-specific and has to be in-house).
Buy when the problem is generic load generation (the user-journey driver against HTTP / WebSocket endpoints is a commodity; off-the-shelf products have had millions of hours of tuning).

Outputs¶

The output of this pattern is not "did it pass"; it's:

Peak-minute throughput the system is verified to sustain (Zalando: "the platform was scaled to sustain a certain amount of incoming traffic and sales in the peak minute").
Scalability bottlenecks surfaced with service-level ownership for each.
Auto-scaling parameter validation — were the HPA / ASG settings tuned enough for the projected peak?
Resilience-pattern validation — "verify if certain resilience patterns work properly" (circuit breakers, retries, timeouts under load).

Contrast¶

vs patterns/load-test-at-scale ¶

load-test-at-scale is pre-migration — proves the new platform can hold the intended workload.
live-load-test-in-production is ongoing discipline — proves the current system's capacity repeatedly, as it evolves, with real dependencies and real data.

vs concepts/chaos-engineering ¶

Chaos engineering induces failures; this pattern induces load. Complementary but orthogonal. A mature org does both.

Seen in¶

— canonical multi-year account of the discipline: two simulator generations, Cyber Week as the forcing function, the build→buy decision for the generic driver.
— Zalando Payments' 2021 follow-up (~5 months later). Names live-prod load testing as complementary, not replacement to their new pre-prod break-things cluster: "Since these load tests are conducted in a non-production environment, we could stress the services till they fail. In combination with load tests in production, this was essential for preparing our production services for higher load." The two disciplines become sibling workstreams in the same org's Cyber-Week-prep capability stack.

Contrast with patterns/declarative-load-test-conductor ¶

Same org (Zalando), five months apart, paired patterns:

	`live-load-test-in-production`	`declarative-load-test-conductor`
Environment	Real production	Test cluster (prod-mirrored)
Abort constraint	Customer impact	None — push to failure
Primary output	Verified peak-minute capacity	Bottleneck + fail-mode list
Run frequency	Pre-peak-event	Scheduled (CronJob)
Evaluation	Real-customer metrics	Grafana + SLO breach alerts

The 2021-03 post uses the phrase "first quality gate" for the conductor's pre-prod discipline — signalling that live-prod load testing remains the final confidence gate, with the conductor upstream of it catching regressions earlier.

concepts/traffic-source-tagging-in-traces — the enabling primitive.
patterns/load-test-at-scale — sibling at different altitude.
patterns/annual-peak-event-as-capability-forcing-function — the org-level forcing function.
patterns/situation-room-for-peak-event — the on-event observational posture for which this pattern produces the confidence signal.
patterns/declarative-load-test-conductor — the sibling pre-prod discipline Zalando built to complement this pattern.
concepts/test-cluster-as-break-things-environment — the environment-level concept behind the sibling discipline.
concepts/observability
companies/zalando