Skip to content

CONCEPT Cited by 1 source

Flaky test

Definition

A flaky test is a test that passes and fails non- deterministically against the same code — producing false positives (fails when the system is working) and false negatives (passes when the system is broken). Flakiness is the chronic failure mode of end-to-end browser-driving tests, where real-network / real-browser / real-async interactions produce a long tail of timing, race, and selector-instability failures that no amount of code-under-test fixing can eliminate.

Why flakiness matters operationally

For CI tests, flakiness wastes engineering time — retriggered builds, debugged-and-dismissed failures, loss of trust in the suite. For test probes running against live production, flakiness is much worse: every false positive is either a pager event (concepts/alert-fatigue) or a silently-ignored alert that masks real incidents. The arithmetic:

Cadence Reliability False positives/day
Per-build CI, 120 builds/day 80 % ~24
Per-build CI, 120 builds/day 95 % ~6
30-min cron probe 95 % ~2.4
30-min cron probe 99 % ~0.48
30-min cron probe 99.9 % ~0.05

Source for CI numbers: sources/2024-07-18-zalando-end-to-end-test-probes-with-playwright — Zalando's Cypress suite ran ≈120 builds/day, started at ~80 % success ("an average of 24 builds a day which were failing as false positives, causing unnecessary friction"), invested multi-year to reach ~95 %. Pager-grade probes require reliability past anything CI infra typically sees.

Named causes (Zalando's empirical list)

  1. Hydration timing under SSR — test scripts execute before the UI is interactive. See concepts/react-hydration. Zalando's Cypress era added a "mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI."
  2. Selector instability — CSS-structure selectors break on visual refactors even when behaviour is unchanged. Remediation: data-testid attributes, role-based locators.
  3. Dynamic content — Zalando's product pages are "highly contextual", sometimes with products not yet released; selectors that assume a specific product / stock state fail on other runs. Remediation: test setup context (seed a known-good product candidate).
  4. Network / third-party timing — real CDN cache misses, real CMS content loads, real API-gateway response times are non-deterministic. Remediation: auto-wait at framework altitude, expect.toPass retries at assertion altitude.
  5. Non-visible-content assertions — the element is in the DOM but not visible (hidden behind a modal, outside the viewport). Remediation: CSS pseudo-classes like :visible (Playwright augments standard CSS with visibility-aware matchers).

Structural remediations

Framework-altitude (lowest effort, highest leverage)

  • Auto-wait — Playwright's Locator auto-waits for attached / visible / stable / enabled before every interaction. Removes an entire class of timing bugs without test-level code.
  • Auto-retry for web assertions — Playwright retries expect(locator).toHaveText(...) until it passes or times out. Covers slow updates.
  • Rich tracing — Playwright captures a step-by-step trace with DOM snapshots; makes the "why did this fail exactly once last Tuesday" postmortem tractable.

Test-altitude (targeted retry)

  • Local retry at assertion level — Playwright's expect.toPass wraps a block in a retry loop. Useful for flaky convergence assertions. Zalando adds these during shadow- mode iteration.
  • Explicit waitForLoadState / waitForURL — for navigation boundaries the framework can't auto-wait on.

Scope-altitude (highest leverage, discipline-heavy)

  • Simplification — fewer scenarios, fewer interactions per scenario. Zalando's probe suite at publication: three named CBOs only. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."
  • Shadow-mode gating — new scenarios enter email-only mode, iterate until zero false positives, then promote. See patterns/shadow-mode-alert-before-paging.

Relationship to test-tier altitude

Tier Flakiness tolerance Remediation
Unit ~0 % (deterministic by design) Fix the code or the test
Integration Low; a few retries OK Test containers + seed data
CI e2e Medium; 95 %+ achievable via multi-year investment Retry flaky tests; selectors discipline
Probe e2e Must approach 99.9 % — pager-grade Scope reduction + auto-wait framework + shadow-mode gating

Seen in

  • sources/2024-07-18-zalando-end-to-end-test-probes-with-playwrightcanonical wiki instance of flakiness-as-pager-grade constraint. Zalando's full remediation arithmetic: 80 % → 95 % Cypress reliability required multi-year investment, but 95 % × 30-min cron = too many false positives for paging. Playwright's auto-wait / auto-retry primitives replaced the hand-rolled hydration-detection kludge, and shadow-mode iteration on selectors + expect.toPass retries eliminated remaining false positives over a few weeks. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident).
Last updated · 501 distilled / 1,218 read