CONCEPT Cited by 1 source

Flaky test¶

Definition¶

A flaky test is a test that passes and fails non- deterministically against the same code — producing false positives (fails when the system is working) and false negatives (passes when the system is broken). Flakiness is the chronic failure mode of end-to-end browser-driving tests, where real-network / real-browser / real-async interactions produce a long tail of timing, race, and selector-instability failures that no amount of code-under-test fixing can eliminate.

Why flakiness matters operationally¶

For CI tests, flakiness wastes engineering time — retriggered builds, debugged-and-dismissed failures, loss of trust in the suite. For test probes running against live production, flakiness is much worse: every false positive is either a pager event (concepts/alert-fatigue) or a silently-ignored alert that masks real incidents. The arithmetic:

Cadence	Reliability	False positives/day
Per-build CI, 120 builds/day	80 %	~24
Per-build CI, 120 builds/day	95 %	~6
30-min cron probe	95 %	~2.4
30-min cron probe	99 %	~0.48
30-min cron probe	99.9 %	~0.05

Source for CI numbers: — Zalando's Cypress suite ran ≈120 builds/day, started at ~80 % success ("an average of 24 builds a day which were failing as false positives, causing unnecessary friction"), invested multi-year to reach ~95 %. Pager-grade probes require reliability past anything CI infra typically sees.

Named causes (Zalando's empirical list)¶

Hydration timing under SSR — test scripts execute before the UI is interactive. See concepts/react-hydration. Zalando's Cypress era added a "mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI."
Selector instability — CSS-structure selectors break on visual refactors even when behaviour is unchanged. Remediation: data-testid attributes, role-based locators.
Dynamic content — Zalando's product pages are "highly contextual", sometimes with products not yet released; selectors that assume a specific product / stock state fail on other runs. Remediation: test setup context (seed a known-good product candidate).
Network / third-party timing — real CDN cache misses, real CMS content loads, real API-gateway response times are non-deterministic. Remediation: auto-wait at framework altitude, expect.toPass retries at assertion altitude.
Non-visible-content assertions — the element is in the DOM but not visible (hidden behind a modal, outside the viewport). Remediation: CSS pseudo-classes like :visible (Playwright augments standard CSS with visibility-aware matchers).

Structural remediations¶

Framework-altitude (lowest effort, highest leverage)¶

Auto-wait — Playwright's Locator auto-waits for attached / visible / stable / enabled before every interaction. Removes an entire class of timing bugs without test-level code.
Auto-retry for web assertions — Playwright retries expect(locator).toHaveText(...) until it passes or times out. Covers slow updates.
Rich tracing — Playwright captures a step-by-step trace with DOM snapshots; makes the "why did this fail exactly once last Tuesday" postmortem tractable.

Test-altitude (targeted retry)¶

Local retry at assertion level — Playwright's expect.toPass wraps a block in a retry loop. Useful for flaky convergence assertions. Zalando adds these during shadow- mode iteration.
Explicit waitForLoadState / waitForURL — for navigation boundaries the framework can't auto-wait on.

Scope-altitude (highest leverage, discipline-heavy)¶

Simplification — fewer scenarios, fewer interactions per scenario. Zalando's probe suite at publication: three named CBOs only. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives."
Shadow-mode gating — new scenarios enter email-only mode, iterate until zero false positives, then promote. See patterns/shadow-mode-alert-before-paging.

Relationship to test-tier altitude¶

Tier	Flakiness tolerance	Remediation
Unit	~0 % (deterministic by design)	Fix the code or the test
Integration	Low; a few retries OK	Test containers + seed data
CI e2e	Medium; 95 %+ achievable via multi-year investment	Retry flaky tests; selectors discipline
Probe e2e	Must approach 99.9 % — pager-grade	Scope reduction + auto-wait framework + shadow-mode gating

Flaky-test triage as KTLO automation target¶

A second, distinct framing of flakiness arrived with Atlassian's 2026-06-01 Jira-team post: flaky-test triage and fix is one of the canonical KTLO engineering chores that maps cleanly onto agentic automation. The argument:

Each flaky test takes a human ~2 hours (inspect CI failure, reproduce locally / under CI conditions, classify test/product bug, prepare fix). At ~1 flaky test/day, that's ≈10 h/week of drag.
The remediation is pattern-recognised — over years a team has seen the same root causes repeatedly (async timing, mocks, fake timers, browser-automation race, page state, snapshot drift) and knows the fix patterns.
Encoded into per-test-category specialist skills (patterns/test-category-classifier-then-specialist-skill), the agent does triage + diagnosis + draft PR; engineers review and merge.

"Our team previously spent two hours resolving a flaky test. We encountered roughly one flaky test per day, sometimes more. [...] Now that we've implemented agentic workflows with Jira, we save roughly one engineering week every month, which means we've reduced eng hours spent on flaky tests by up to 80%." (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

The agent's reproduction discipline is the CPU-throttled-loop Atlassian describes: "our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." This addresses the canonical laptop ≠ CI environment failure mode — without throttled-loop reproduction, the agent's "green on first run" signal is too weak to gate a fix.

This second framing does not replace the Zalando flakiness-as-pager-grade-constraint framing — they sit at different altitudes. Zalando's framing is about which tests are even worth running at pager altitude (probe-grade scope reduction, auto-wait framework, shadow-mode gating). Atlassian's framing is about how to remediate the long tail of CI flakes that survive the framework defences. They compose: the framework prevents flakes from entering the suite; the agent processes the residual flakes that get through.

Seen in¶

— canonical wiki instance of flakiness-as-pager-grade constraint. Zalando's full remediation arithmetic: 80 % → 95 % Cypress reliability required multi-year investment, but 95 % × 30-min cron = too many false positives for paging. Playwright's auto-wait / auto-retry primitives replaced the hand-rolled hydration-detection kludge, and shadow-mode iteration on selectors + expect.toPass retries eliminated remaining false positives over a few weeks. Post-promotion: 0 % false-positive rate (only pager firing was on a real incident).
sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in — canonical wiki instance of flaky-test triage as agentic KTLO target. Atlassian's Jira team applied patterns/test-category-classifier-then-specialist-skill to the long tail of CI flakes: a unit / integration / visual- regression classifier dispatches to a category specialist skill bundling failure-mode taxonomy + fix patterns + CPU-throttled reproduction instructions. ~80% reduction in flaky-test eng hours; ~1 engineering week saved per month. Composes with the wider work-item-as-agent-prompt + Jira status-transition trigger substrate (patterns/jira-status-transition-triggers-agent-workflow).

concepts/end-to-end-test-probe — the primitive whose flakiness is the load-bearing constraint.
concepts/test-reliability-through-simplification — primary scope-altitude lever.
concepts/alert-fatigue — the consequence of under-remediated flakiness.
concepts/playwright-locator-auto-wait — framework- altitude lever.
concepts/react-hydration — one of the Zalando-named cause classes.
concepts/ktlo-engineering-chores — the work-category framing for the Atlassian agentic-remediation axis.
concepts/work-item-as-agent-prompt — the substrate the agentic remediation runs through.
concepts/agent-as-first-pass-investigator — the operational model for the agentic remediation.
systems/playwright — Zalando's chosen flakiness- resilient framework.
systems/cypress — the pre-existing framework whose flakiness drove the Playwright adoption.
systems/jira — Atlassian's substrate for the agentic remediation.
systems/rovo-dev — likely consuming agent.
patterns/e2e-test-as-synthetic-probe — the pattern that makes flakiness a pager problem, not a CI problem.
patterns/shadow-mode-alert-before-paging — the flakiness-validation gate.
patterns/test-category-classifier-then-specialist-skill — Atlassian's agent-skill dispatch for flaky-test remediation.
patterns/jira-status-transition-triggers-agent-workflow — Atlassian's trigger for the agent run.