CONCEPT Cited by 1 source
Flaky test¶
Definition¶
A flaky test is a test that passes and fails non- deterministically against the same code — producing false positives (fails when the system is working) and false negatives (passes when the system is broken). Flakiness is the chronic failure mode of end-to-end browser-driving tests, where real-network / real-browser / real-async interactions produce a long tail of timing, race, and selector-instability failures that no amount of code-under-test fixing can eliminate.
Why flakiness matters operationally¶
For CI tests, flakiness wastes engineering time — retriggered builds, debugged-and-dismissed failures, loss of trust in the suite. For test probes running against live production, flakiness is much worse: every false positive is either a pager event (concepts/alert-fatigue) or a silently-ignored alert that masks real incidents. The arithmetic:
| Cadence | Reliability | False positives/day |
|---|---|---|
| Per-build CI, 120 builds/day | 80 % | ~24 |
| Per-build CI, 120 builds/day | 95 % | ~6 |
| 30-min cron probe | 95 % | ~2.4 |
| 30-min cron probe | 99 % | ~0.48 |
| 30-min cron probe | 99.9 % | ~0.05 |
Source for CI numbers: — Zalando's Cypress suite ran ≈120 builds/day, started at ~80 % success ("an average of 24 builds a day which were failing as false positives, causing unnecessary friction"), invested multi-year to reach ~95 %. Pager-grade probes require reliability past anything CI infra typically sees.
Named causes (Zalando's empirical list)¶
- Hydration timing under SSR — test scripts execute before the UI is interactive. See concepts/react-hydration. Zalando's Cypress era added a "mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI."
- Selector instability — CSS-structure selectors break
on visual refactors even when behaviour is unchanged.
Remediation:
data-testidattributes, role-based locators. - Dynamic content — Zalando's product pages are "highly contextual", sometimes with products not yet released; selectors that assume a specific product / stock state fail on other runs. Remediation: test setup context (seed a known-good product candidate).
- Network / third-party timing — real CDN cache misses,
real CMS content loads, real API-gateway response times
are non-deterministic. Remediation: auto-wait at
framework altitude,
expect.toPassretries at assertion altitude. - Non-visible-content assertions — the element is in
the DOM but not visible (hidden behind a modal, outside
the viewport). Remediation: CSS pseudo-classes like
:visible(Playwright augments standard CSS with visibility-aware matchers).
Structural remediations¶
Framework-altitude (lowest effort, highest leverage)¶
- Auto-wait —
Playwright's
Locatorauto-waits for attached / visible / stable / enabled before every interaction. Removes an entire class of timing bugs without test-level code. - Auto-retry for web assertions — Playwright retries
expect(locator).toHaveText(...)until it passes or times out. Covers slow updates. - Rich tracing — Playwright captures a step-by-step trace with DOM snapshots; makes the "why did this fail exactly once last Tuesday" postmortem tractable.
Test-altitude (targeted retry)¶
- Local retry at assertion level — Playwright's
expect.toPasswraps a block in a retry loop. Useful for flaky convergence assertions. Zalando adds these during shadow- mode iteration. - Explicit
waitForLoadState/waitForURL— for navigation boundaries the framework can't auto-wait on.
Scope-altitude (highest leverage, discipline-heavy)¶
- Simplification — fewer scenarios, fewer interactions per scenario. Zalando's probe suite at publication: three named CBOs only. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."
- Shadow-mode gating — new scenarios enter email-only mode, iterate until zero false positives, then promote. See patterns/shadow-mode-alert-before-paging.
Relationship to test-tier altitude¶
| Tier | Flakiness tolerance | Remediation |
|---|---|---|
| Unit | ~0 % (deterministic by design) | Fix the code or the test |
| Integration | Low; a few retries OK | Test containers + seed data |
| CI e2e | Medium; 95 %+ achievable via multi-year investment | Retry flaky tests; selectors discipline |
| Probe e2e | Must approach 99.9 % — pager-grade | Scope reduction + auto-wait framework + shadow-mode gating |
Flaky-test triage as KTLO automation target¶
A second, distinct framing of flakiness arrived with Atlassian's 2026-06-01 Jira-team post: flaky-test triage and fix is one of the canonical KTLO engineering chores that maps cleanly onto agentic automation. The argument:
- Each flaky test takes a human ~2 hours (inspect CI failure, reproduce locally / under CI conditions, classify test/product bug, prepare fix). At ~1 flaky test/day, that's ≈10 h/week of drag.
- The remediation is pattern-recognised — over years a team has seen the same root causes repeatedly (async timing, mocks, fake timers, browser-automation race, page state, snapshot drift) and knows the fix patterns.
- Encoded into per-test-category specialist skills (patterns/test-category-classifier-then-specialist-skill), the agent does triage + diagnosis + draft PR; engineers review and merge.
"Our team previously spent two hours resolving a flaky test. We encountered roughly one flaky test per day, sometimes more. [...] Now that we've implemented agentic workflows with Jira, we save roughly one engineering week every month, which means we've reduced eng hours spent on flaky tests by up to 80%." (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)
The agent's reproduction discipline is the CPU-throttled-loop Atlassian describes: "our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." This addresses the canonical laptop ≠ CI environment failure mode — without throttled-loop reproduction, the agent's "green on first run" signal is too weak to gate a fix.
This second framing does not replace the Zalando flakiness-as-pager-grade-constraint framing — they sit at different altitudes. Zalando's framing is about which tests are even worth running at pager altitude (probe-grade scope reduction, auto-wait framework, shadow-mode gating). Atlassian's framing is about how to remediate the long tail of CI flakes that survive the framework defences. They compose: the framework prevents flakes from entering the suite; the agent processes the residual flakes that get through.
Seen in¶
-
— canonical wiki instance of flakiness-as-pager-grade constraint. Zalando's full remediation arithmetic: 80 % → 95 % Cypress reliability required multi-year investment, but 95 % × 30-min cron = too many false positives for paging. Playwright's auto-wait / auto-retry primitives replaced the hand-rolled hydration-detection kludge, and shadow-mode iteration on selectors +
expect.toPassretries eliminated remaining false positives over a few weeks. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident). -
sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in — canonical wiki instance of flaky-test triage as agentic KTLO target. Atlassian's Jira team applied patterns/test-category-classifier-then-specialist-skill to the long tail of CI flakes: a unit / integration / visual- regression classifier dispatches to a category specialist skill bundling failure-mode taxonomy + fix patterns + CPU-throttled reproduction instructions. ~80% reduction in flaky-test eng hours; ~1 engineering week saved per month. Composes with the wider work-item-as-agent-prompt + Jira status-transition trigger substrate (patterns/jira-status-transition-triggers-agent-workflow).
Related¶
- concepts/end-to-end-test-probe — the primitive whose flakiness is the load-bearing constraint.
- concepts/test-reliability-through-simplification — primary scope-altitude lever.
- concepts/alert-fatigue — the consequence of under-remediated flakiness.
- concepts/playwright-locator-auto-wait — framework- altitude lever.
- concepts/react-hydration — one of the Zalando-named cause classes.
- concepts/ktlo-engineering-chores — the work-category framing for the Atlassian agentic-remediation axis.
- concepts/work-item-as-agent-prompt — the substrate the agentic remediation runs through.
- concepts/agent-as-first-pass-investigator — the operational model for the agentic remediation.
- systems/playwright — Zalando's chosen flakiness- resilient framework.
- systems/cypress — the pre-existing framework whose flakiness drove the Playwright adoption.
- systems/jira — Atlassian's substrate for the agentic remediation.
- systems/rovo-dev — likely consuming agent.
- patterns/e2e-test-as-synthetic-probe — the pattern that makes flakiness a pager problem, not a CI problem.
- patterns/shadow-mode-alert-before-paging — the flakiness-validation gate.
- patterns/test-category-classifier-then-specialist-skill — Atlassian's agent-skill dispatch for flaky-test remediation.
- patterns/jira-status-transition-triggers-agent-workflow — Atlassian's trigger for the agent run.