Skip to content

CONCEPT Cited by 1 source

Flaky test

Definition

A flaky test is a test that passes and fails non- deterministically against the same code — producing false positives (fails when the system is working) and false negatives (passes when the system is broken). Flakiness is the chronic failure mode of end-to-end browser-driving tests, where real-network / real-browser / real-async interactions produce a long tail of timing, race, and selector-instability failures that no amount of code-under-test fixing can eliminate.

Why flakiness matters operationally

For CI tests, flakiness wastes engineering time — retriggered builds, debugged-and-dismissed failures, loss of trust in the suite. For test probes running against live production, flakiness is much worse: every false positive is either a pager event (concepts/alert-fatigue) or a silently-ignored alert that masks real incidents. The arithmetic:

Cadence Reliability False positives/day
Per-build CI, 120 builds/day 80 % ~24
Per-build CI, 120 builds/day 95 % ~6
30-min cron probe 95 % ~2.4
30-min cron probe 99 % ~0.48
30-min cron probe 99.9 % ~0.05

Source for CI numbers: — Zalando's Cypress suite ran ≈120 builds/day, started at ~80 % success ("an average of 24 builds a day which were failing as false positives, causing unnecessary friction"), invested multi-year to reach ~95 %. Pager-grade probes require reliability past anything CI infra typically sees.

Named causes (Zalando's empirical list)

  1. Hydration timing under SSR — test scripts execute before the UI is interactive. See concepts/react-hydration. Zalando's Cypress era added a "mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI."
  2. Selector instability — CSS-structure selectors break on visual refactors even when behaviour is unchanged. Remediation: data-testid attributes, role-based locators.
  3. Dynamic content — Zalando's product pages are "highly contextual", sometimes with products not yet released; selectors that assume a specific product / stock state fail on other runs. Remediation: test setup context (seed a known-good product candidate).
  4. Network / third-party timing — real CDN cache misses, real CMS content loads, real API-gateway response times are non-deterministic. Remediation: auto-wait at framework altitude, expect.toPass retries at assertion altitude.
  5. Non-visible-content assertions — the element is in the DOM but not visible (hidden behind a modal, outside the viewport). Remediation: CSS pseudo-classes like :visible (Playwright augments standard CSS with visibility-aware matchers).

Structural remediations

Framework-altitude (lowest effort, highest leverage)

  • Auto-wait — Playwright's Locator auto-waits for attached / visible / stable / enabled before every interaction. Removes an entire class of timing bugs without test-level code.
  • Auto-retry for web assertions — Playwright retries expect(locator).toHaveText(...) until it passes or times out. Covers slow updates.
  • Rich tracing — Playwright captures a step-by-step trace with DOM snapshots; makes the "why did this fail exactly once last Tuesday" postmortem tractable.

Test-altitude (targeted retry)

  • Local retry at assertion level — Playwright's expect.toPass wraps a block in a retry loop. Useful for flaky convergence assertions. Zalando adds these during shadow- mode iteration.
  • Explicit waitForLoadState / waitForURL — for navigation boundaries the framework can't auto-wait on.

Scope-altitude (highest leverage, discipline-heavy)

  • Simplification — fewer scenarios, fewer interactions per scenario. Zalando's probe suite at publication: three named CBOs only. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."
  • Shadow-mode gating — new scenarios enter email-only mode, iterate until zero false positives, then promote. See patterns/shadow-mode-alert-before-paging.

Relationship to test-tier altitude

Tier Flakiness tolerance Remediation
Unit ~0 % (deterministic by design) Fix the code or the test
Integration Low; a few retries OK Test containers + seed data
CI e2e Medium; 95 %+ achievable via multi-year investment Retry flaky tests; selectors discipline
Probe e2e Must approach 99.9 % — pager-grade Scope reduction + auto-wait framework + shadow-mode gating

Flaky-test triage as KTLO automation target

A second, distinct framing of flakiness arrived with Atlassian's 2026-06-01 Jira-team post: flaky-test triage and fix is one of the canonical KTLO engineering chores that maps cleanly onto agentic automation. The argument:

  • Each flaky test takes a human ~2 hours (inspect CI failure, reproduce locally / under CI conditions, classify test/product bug, prepare fix). At ~1 flaky test/day, that's ≈10 h/week of drag.
  • The remediation is pattern-recognised — over years a team has seen the same root causes repeatedly (async timing, mocks, fake timers, browser-automation race, page state, snapshot drift) and knows the fix patterns.
  • Encoded into per-test-category specialist skills (patterns/test-category-classifier-then-specialist-skill), the agent does triage + diagnosis + draft PR; engineers review and merge.

"Our team previously spent two hours resolving a flaky test. We encountered roughly one flaky test per day, sometimes more. [...] Now that we've implemented agentic workflows with Jira, we save roughly one engineering week every month, which means we've reduced eng hours spent on flaky tests by up to 80%." (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

The agent's reproduction discipline is the CPU-throttled-loop Atlassian describes: "our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." This addresses the canonical laptop ≠ CI environment failure mode — without throttled-loop reproduction, the agent's "green on first run" signal is too weak to gate a fix.

This second framing does not replace the Zalando flakiness-as-pager-grade-constraint framing — they sit at different altitudes. Zalando's framing is about which tests are even worth running at pager altitude (probe-grade scope reduction, auto-wait framework, shadow-mode gating). Atlassian's framing is about how to remediate the long tail of CI flakes that survive the framework defences. They compose: the framework prevents flakes from entering the suite; the agent processes the residual flakes that get through.

Seen in

  • canonical wiki instance of flakiness-as-pager-grade constraint. Zalando's full remediation arithmetic: 80 % → 95 % Cypress reliability required multi-year investment, but 95 % × 30-min cron = too many false positives for paging. Playwright's auto-wait / auto-retry primitives replaced the hand-rolled hydration-detection kludge, and shadow-mode iteration on selectors + expect.toPass retries eliminated remaining false positives over a few weeks. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident).

  • sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-incanonical wiki instance of flaky-test triage as agentic KTLO target. Atlassian's Jira team applied patterns/test-category-classifier-then-specialist-skill to the long tail of CI flakes: a unit / integration / visual- regression classifier dispatches to a category specialist skill bundling failure-mode taxonomy + fix patterns + CPU-throttled reproduction instructions. ~80% reduction in flaky-test eng hours; ~1 engineering week saved per month. Composes with the wider work-item-as-agent-prompt + Jira status-transition trigger substrate (patterns/jira-status-transition-triggers-agent-workflow).

Last updated · 542 distilled / 1,571 read