Skip to content

CONCEPT Cited by 1 source

Test cluster as break-things environment

Definition

A break-things test cluster is a pre-production environment that is deliberately scaled, versioned, and configured to match production, but whose explicit purpose is to let engineers push load until the system fails — a discipline that is unsafe in production because it would degrade real customers.

Zalando's framing, verbatim (Source: sources/2021-03-01-zalando-building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes):

"In order to really push our services to the edge, we wanted to run the load testing system in our test cluster, as this enables us to break things when necessary without causing customer impact."

The key property: discovery past failure

This concept answers a specific question that live-load-test-in- production cannot answer:

"What happens to the system when load exceeds capacity?"

Live-prod load testing's governing constraint is abort on customer impact. You cannot observe post-saturation failure modes — retry storms, queue overflow behaviour, degraded- dependency fallbacks — because the abort fires first.

A break-things cluster has no abort constraint (no real customers), so the operator can:

  • Observe the first bottleneck, fix it, and re-run.
  • Run multiple saturation experiments per week.
  • Validate fail-safe behaviour (circuit breakers tripping, bulkheads isolating, degraded-mode fallbacks engaging).
  • Measure unsafe numerics (memory growth at 2x load, disk-spill behaviour, thread-pool exhaustion) that customer-facing monitoring would suppress.

The required invariant: production parity

A break-things cluster only produces transferable insight if it is close enough to production that failure modes reproduce. Parity is expensive:

Zalando's post is explicit that the parity effort is ongoing and imperfect: "Several infrastructure components like cluster node type, databases, centrally managed event queues had to be adjusted for similarity with the production environment. This required a lot of communication effort and alignment with teams managing the services."

Complementary, not replacement

Zalando's conclusion is explicit:

"Since these load tests are conducted in a non-production environment, we could stress the services till they fail. In combination with load tests in production, this was essential for preparing our production services for higher load."

The two disciplines answer different questions:

Break-things test cluster Live-prod load test
Question Where does it fail? What's today's sustainable capacity?
Abort constraint None Customer impact
Fidelity Good (parity cost) Perfect (it's prod)
Run frequency High (no prod risk) Low (every run costs)
Output Bottleneck list + fail modes Confidence capacity number

A mature org runs both. The break-things cluster explores; live-prod load tests verify.

Anti-patterns

  • Scaling the test cluster smaller than prod — failure modes don't reproduce, and the test becomes a test of the test cluster, not of production.
  • Skipping infrastructure-layer parity — the application looks fine until the shared database or event bus becomes the bottleneck, and the test cluster's undersized version bottlenecks earlier than prod would, producing a false "ready for peak" negative.
  • Reading break-things results as capacity numbers — the purpose is failure-mode discovery, not peak-minute commitment. Those come from live-prod load tests.
  • Treating the cluster as a permanent staging environment — it's a load-test substrate; using it for routine QA dilutes its break-things charter.

Seen in

Last updated · 476 distilled / 1,218 read