CONCEPT

Test reliability through simplification¶

Definition¶

Test reliability through simplification is the discipline of improving test-suite reliability by reducing the number of scenarios and the number of interactions per scenario — before (or instead of) adding retry logic, harder selectors, or better timing heuristics. The working hypothesis is that the probability of a flaky run compounds with each interaction, so halving the interactions roughly halves the flake rate.

Zalando names this explicitly ():

"What we needed was higher resiliency and one of the ways to achieve this is often through simplification. We decided that for the end-to-end test probes we would run a cron job with scenarios covering critical customer journeys. [...] By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."

The arithmetic¶

Flake probability compounds multiplicatively. If each atomic interaction (click, navigation, assertion) has probability $p$ of succeeding, a scenario with $N$ interactions passes with probability $p^N$. For $p = 0.998$ per interaction:

Scenario size	Pass rate
5 interactions	99.0 %
10 interactions	98.0 %
20 interactions	96.1 %
50 interactions	90.5 %
100 interactions	81.9 %

The CI-era Zalando Cypress suite with broad coverage and many interactions per test sat at ~95 % — exactly the regime the compound-flake curve predicts. Probe-grade reliability requires scenarios that stay in the left half of the table.

Why simplification beats retry¶

Retry masks signal. A test that passes on the third retry still has something wrong; retrying hides the underlying issue and normalises fragility.
Retry compounds runtime. Each retried interaction is its own wall-clock delay; a 5-retry worst-case scenario becomes 5× slower on the worst-case run.
Retry doesn't help structural flakes. Selector instability across renders, race conditions on DOM reflow, and third-party timing issues don't resolve by waiting longer.
Simplification actually removes the interactions that are flaky. You can't have a flake on a step you didn't execute.

What to simplify¶

Scenario count. Zalando's probe suite at publication: three named scenarios, each mapping to a single CBO. The discipline: one scenario per CBO, not per code path.
Interactions per scenario. Prune steps that don't contribute to the CBO's success criterion. Zalando's catalog-landing test is ~10 lines of code — navigate, open filter, apply filter, wait, click product, assert.
Setup / teardown. Each DB seed, each cookie clear, each API precall is an interaction with its own flake probability. Collapse where possible.
Assertions. Fewer, coarser assertions per scenario beat many fine-grained ones. One end-state assertion

five intermediate-state assertions.

Relationship to CBO alignment¶

Simplification implicitly enforces the probe-CBO alignment discipline: if you constrain the probe tier to one scenario per CBO, simplification falls out naturally. The full CBO catalog is usually small (single-digit to low-hundreds); with one probe per CBO, the suite stays bounded.

When simplification does not work¶

When the CBO itself is genuinely multi-step. A checkout journey legitimately requires login → cart → address → payment → confirmation; you can't skip steps. Remediation here is framework-altitude (auto-wait / auto-retry, see concepts/playwright-locator-auto-wait) plus assertion-altitude (expect.toPass for flaky convergence points).
When CI coverage needs exhaustiveness. A CI regression suite legitimately needs to exercise many paths; probe- tier simplification doesn't apply there. The CI tier and probe tier have different reliability budgets.

Contrast: the Cypress-era Zalando approach¶

Before the probe tier, Zalando's response to flaky Cypress tests was harder selectors, better timing heuristics, and a hydration-detection mechanism — all higher-sophistication tooling inside the same broad-coverage scope. That got the suite from ~80 % to ~95 % over multi-year investment. The probe tier's simplification move is a different class of lever: reduce the problem instead of improving the solution.

Seen in¶

— canonical wiki instance. Zalando's probe tier: "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives." Three named scenarios at publication; each scenario is ~10 lines of code. Combined with Playwright auto-wait and shadow- mode validation, this produced 0 % false-positive rate after promotion to paging.

concepts/flaky-test — the failure mode simplification fights.
concepts/end-to-end-test-probe — the primitive whose reliability budget demands this discipline.
concepts/critical-business-operation — the natural scoping unit that simplification aligns the probe suite against.
patterns/scenario-minimalism-for-probe-reliability — the pattern form.
patterns/e2e-test-as-synthetic-probe — the pattern that surfaces this discipline as a first-class constraint.
systems/playwright — the auto-wait framework that complements simplification.