CONCEPT Cited by 1 source
Flaky test¶
Definition¶
A flaky test is a test that passes and fails non- deterministically against the same code — producing false positives (fails when the system is working) and false negatives (passes when the system is broken). Flakiness is the chronic failure mode of end-to-end browser-driving tests, where real-network / real-browser / real-async interactions produce a long tail of timing, race, and selector-instability failures that no amount of code-under-test fixing can eliminate.
Why flakiness matters operationally¶
For CI tests, flakiness wastes engineering time — retriggered builds, debugged-and-dismissed failures, loss of trust in the suite. For test probes running against live production, flakiness is much worse: every false positive is either a pager event (concepts/alert-fatigue) or a silently-ignored alert that masks real incidents. The arithmetic:
| Cadence | Reliability | False positives/day |
|---|---|---|
| Per-build CI, 120 builds/day | 80 % | ~24 |
| Per-build CI, 120 builds/day | 95 % | ~6 |
| 30-min cron probe | 95 % | ~2.4 |
| 30-min cron probe | 99 % | ~0.48 |
| 30-min cron probe | 99.9 % | ~0.05 |
Source for CI numbers: sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright — Zalando's Cypress suite ran ≈120 builds/day, started at ~80 % success ("an average of 24 builds a day which were failing as false positives, causing unnecessary friction"), invested multi-year to reach ~95 %. Pager-grade probes require reliability past anything CI infra typically sees.
Named causes (Zalando's empirical list)¶
- Hydration timing under SSR — test scripts execute before the UI is interactive. See concepts/react-hydration. Zalando's Cypress era added a "mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI."
- Selector instability — CSS-structure selectors break
on visual refactors even when behaviour is unchanged.
Remediation:
data-testidattributes, role-based locators. - Dynamic content — Zalando's product pages are "highly contextual", sometimes with products not yet released; selectors that assume a specific product / stock state fail on other runs. Remediation: test setup context (seed a known-good product candidate).
- Network / third-party timing — real CDN cache misses,
real CMS content loads, real API-gateway response times
are non-deterministic. Remediation: auto-wait at
framework altitude,
expect.toPassretries at assertion altitude. - Non-visible-content assertions — the element is in
the DOM but not visible (hidden behind a modal, outside
the viewport). Remediation: CSS pseudo-classes like
:visible(Playwright augments standard CSS with visibility-aware matchers).
Structural remediations¶
Framework-altitude (lowest effort, highest leverage)¶
- Auto-wait —
Playwright's
Locatorauto-waits for attached / visible / stable / enabled before every interaction. Removes an entire class of timing bugs without test-level code. - Auto-retry for web assertions — Playwright retries
expect(locator).toHaveText(...)until it passes or times out. Covers slow updates. - Rich tracing — Playwright captures a step-by-step trace with DOM snapshots; makes the "why did this fail exactly once last Tuesday" postmortem tractable.
Test-altitude (targeted retry)¶
- Local retry at assertion level — Playwright's
expect.toPasswraps a block in a retry loop. Useful for flaky convergence assertions. Zalando adds these during shadow- mode iteration. - Explicit
waitForLoadState/waitForURL— for navigation boundaries the framework can't auto-wait on.
Scope-altitude (highest leverage, discipline-heavy)¶
- Simplification — fewer scenarios, fewer interactions per scenario. Zalando's probe suite at publication: three named CBOs only. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."
- Shadow-mode gating — new scenarios enter email-only mode, iterate until zero false positives, then promote. See patterns/shadow-mode-alert-before-paging.
Relationship to test-tier altitude¶
| Tier | Flakiness tolerance | Remediation |
|---|---|---|
| Unit | ~0 % (deterministic by design) | Fix the code or the test |
| Integration | Low; a few retries OK | Test containers + seed data |
| CI e2e | Medium; 95 %+ achievable via multi-year investment | Retry flaky tests; selectors discipline |
| Probe e2e | Must approach 99.9 % — pager-grade | Scope reduction + auto-wait framework + shadow-mode gating |
Seen in¶
- sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright
— canonical wiki instance of flakiness-as-pager-grade
constraint. Zalando's full remediation arithmetic: 80 %
→ 95 % Cypress reliability required multi-year investment,
but 95 % × 30-min cron = too many false positives for
paging. Playwright's auto-wait / auto-retry primitives
replaced the hand-rolled hydration-detection kludge, and
shadow-mode iteration on selectors +
expect.toPassretries eliminated remaining false positives over a few weeks. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident).
Related¶
- concepts/end-to-end-test-probe — the primitive whose flakiness is the load-bearing constraint.
- concepts/test-reliability-through-simplification — primary scope-altitude lever.
- concepts/alert-fatigue — the consequence of under-remediated flakiness.
- concepts/playwright-locator-auto-wait — framework- altitude lever.
- concepts/react-hydration — one of the Zalando-named cause classes.
- systems/playwright — Zalando's chosen flakiness- resilient framework.
- systems/cypress — the pre-existing framework whose flakiness drove the Playwright adoption.
- patterns/e2e-test-as-synthetic-probe — the pattern that makes flakiness a pager problem, not a CI problem.
- patterns/shadow-mode-alert-before-paging — the flakiness-validation gate.