Skip to content

PATTERN Cited by 1 source

Test category classifier then specialist skill

Intent

Before dispatching an agent to fix a flaky test, classify the test type (unit / integration / visual regression / …) and load a category-specialist orchestration skill rather than a single mega-skill containing every possible failure-mode guidance. The classifier preserves agent context budget by avoiding loading guidance for failure modes irrelevant to the current test.

Canonical articulation

Atlassian's Jira-team flaky-test workflow:

*"Rather than using one generic workflow for every flaky test, the skill can look at the type of test involved and apply specialised instructions for that category.

For example: - Unit test specialist: focuses on asynchronous timing issues, mocks, fake timers, and test isolation. - Integration test specialist: focuses on browser automation issues, network races, page stability, and environment setup. - Visual regression specialist: focuses on deterministic rendering, snapshot updates, image diffs, and visual test stability.

To ensure our agents can diagnose the correct issue, each skill also includes reproduction instructions. For example, our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible. This helps the agent reproduce intermittent failures that may not show up during a single local test run."* (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

Shape

   Flaky-test work item arrives (test path + CI failure log)
   ┌────────────────────────────────────────┐
   │ CLASSIFIER stage                       │
   │  - Inspect test file path, framework,  │
   │    glob patterns                       │
   │  - Output: category ∈ {unit, integration,
   │    visual-regression, …}               │
   └────────────────────────────────────────┘
   Switch on category:
   ┌────────────┐  ┌────────────────┐  ┌────────────────────┐
   │ UNIT skill │  │ INTEGRATION    │  │ VISUAL REGRESSION  │
   │ - async    │  │ skill          │  │ skill              │
   │   timing   │  │ - browser auto │  │ - deterministic    │
   │ - mocks    │  │ - network race │  │   rendering        │
   │ - fake     │  │ - page state   │  │ - snapshot updates │
   │   timers   │  │ - env setup    │  │ - image diffs      │
   │ - test     │  │                │  │ - visual stability │
   │   isolation│  │                │  │                    │
   └────┬───────┘  └────┬───────────┘  └─────┬──────────────┘
        │               │                     │
        └────────────── reproduction ─────────┘
                  (CPU-throttled,
                   repeated runs)
                  Diagnose + fix + draft PR

Why classify-then-dispatch (not one mega-skill)

Strategy Pros Cons Atlassian's choice
Single mega-skill One file to maintain Context-window-bloat; dilution of specialist guidance; agent has to read everything Rejected
Classify + specialist Each skill is small + focused; classifier is lightweight; easy to extend with new categories Two-stage dispatch; classifier can be wrong Chosen
Skill-per-test Maximally specialised Combinatorial explosion; impossible to maintain Not viable

The classify-then-dispatch shape is a standard divide-and-conquer move translated into agent skills: classifier picks the bucket; specialist solves within the bucket.

Specialist skill contents

Each specialist skill bundles three things:

  1. Failure-mode taxonomy for that category — what kinds of flake show up here (e.g. for unit tests: async timing, mocks, fake timers, isolation).
  2. Fix patterns matched to failure modes — well-known remediations the team has used before (encoded experience).
  3. Reproduction instructions — how to reliably repro the flake locally before fixing. "Our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." The CPU-throttled loop is the agent's "is this still flaky after my fix?" verifier.

Why CPU-throttled reproduction is load-bearing

Flaky tests by definition do not always fail on a single local run; the developer-laptop / CI difference is one of the canonical flaky-test root causes. Mimicking CI's constrained environment is the discipline that makes:

  • Reproduction before fixing — agent confirms the flake exists on its machine, not just CI's.
  • Verification after fixing — agent confirms the flake is gone after applying the fix pattern.

Without throttled-loop reproduction, the agent's "green on first run" signal is too weak to gate the PR.

The classifier itself

The post does not describe the classifier in depth. Plausible implementations:

  • Path-pattern matchingtests/unit/** → unit; tests/integration/** → integration; tests/visual/** → visual. Simple and reliable when the repo has a clean layout.
  • Test framework / glob heuristics*.spec.ts running under Jest with a jsdom environment → unit; *.e2e.ts under Playwright → integration; *.snap files present → visual regression.
  • LLM-based classification — read the failing test file and ask the agent to classify; flexible but adds a classifier round-trip.

Atlassian doesn't disclose which they use; the path-pattern + framework-heuristic approach is the cheapest and most reliable default.

Composes with

Operational outcome

Atlassian's Jira-team flaky-test workflow:

  • Pre-automation: ~1 flaky test/day × ~2 hours/test ≈ 10 hours/week.
  • Post-automation: "~80% reduction in eng hours spent on flaky tests"; ~1 engineering week saved per month.

The category-specialist dispatch is one of several load-bearing components of this outcome (along with triage-vs-fix split, work-item-as-prompt, and human review gate).

Caveats

  • Misclassification cost. A test classified as "unit" but actually integration loads the wrong skill, the agent applies unit-test fix patterns, fails to reproduce against the real flake, and either gives up or proposes a wrong fix. Classifier precision is the load-bearing input.
  • Category drift over time. As frameworks evolve (e.g. a new test type — contract tests, fuzz tests, performance regression tests), the classifier and the specialist skill set must be extended. Governance not described.
  • Specialist skill content can go stale. A unit-test specialist skill written for the team's mocking library three years ago may not match current conventions. No freshness contract described.
  • Reproduction isn't always reproducible. Some flakes depend on real-network / real-CDN / real-browser conditions the agent's local CPU-throttling can't simulate; the post acknowledges this implicitly ("mimic CI condition as closely as possible") but doesn't disclose escape rate.

Sibling on adjacent KTLO categories

The category-classifier-then-specialist pattern can be applied to:

  • Vulnerability remediation: classify by CVE category (deserialization, XSS, SSRF, dependency-CVE) → specialist fix pattern.
  • Accessibility fixes: classify by WCAG criterion → specialist remediation skill.
  • Bug long-tail: classify by failure-mode (null-pointer, off-by-one, race) → specialist debugging skill.

Atlassian only documents the flaky-test instance; the pattern generalises across KTLO categories with similar divide-and-conquer dispatch shape.

Seen in

Last updated · 542 distilled / 1,571 read