PATTERN Cited by 1 source

Test category classifier then specialist skill¶

Intent¶

Before dispatching an agent to fix a flaky test, classify the test type (unit / integration / visual regression / …) and load a category-specialist orchestration skill rather than a single mega-skill containing every possible failure-mode guidance. The classifier preserves agent context budget by avoiding loading guidance for failure modes irrelevant to the current test.

Canonical articulation¶

Atlassian's Jira-team flaky-test workflow:

*"Rather than using one generic workflow for every flaky test, the skill can look at the type of test involved and apply specialised instructions for that category.

For example: - Unit test specialist: focuses on asynchronous timing issues, mocks, fake timers, and test isolation. - Integration test specialist: focuses on browser automation issues, network races, page stability, and environment setup. - Visual regression specialist: focuses on deterministic rendering, snapshot updates, image diffs, and visual test stability.

To ensure our agents can diagnose the correct issue, each skill also includes reproduction instructions. For example, our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible. This helps the agent reproduce intermittent failures that may not show up during a single local test run."* (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

Shape¶

   Flaky-test work item arrives (test path + CI failure log)
              │
              ▼
   ┌────────────────────────────────────────┐
   │ CLASSIFIER stage                       │
   │  - Inspect test file path, framework,  │
   │    glob patterns                       │
   │  - Output: category ∈ {unit, integration,
   │    visual-regression, …}               │
   └────────────────────────────────────────┘
              │
              ▼
   Switch on category:
   ┌────────────┐  ┌────────────────┐  ┌────────────────────┐
   │ UNIT skill │  │ INTEGRATION    │  │ VISUAL REGRESSION  │
   │ - async    │  │ skill          │  │ skill              │
   │   timing   │  │ - browser auto │  │ - deterministic    │
   │ - mocks    │  │ - network race │  │   rendering        │
   │ - fake     │  │ - page state   │  │ - snapshot updates │
   │   timers   │  │ - env setup    │  │ - image diffs      │
   │ - test     │  │                │  │ - visual stability │
   │   isolation│  │                │  │                    │
   └────┬───────┘  └────┬───────────┘  └─────┬──────────────┘
        │               │                     │
        └────────────── reproduction ─────────┘
                  (CPU-throttled,
                   repeated runs)
                         │
                         ▼
                  Diagnose + fix + draft PR

Why classify-then-dispatch (not one mega-skill)¶

Strategy	Pros	Cons	Atlassian's choice
Single mega-skill	One file to maintain	Context-window-bloat; dilution of specialist guidance; agent has to read everything	Rejected
Classify + specialist	Each skill is small + focused; classifier is lightweight; easy to extend with new categories	Two-stage dispatch; classifier can be wrong	Chosen
Skill-per-test	Maximally specialised	Combinatorial explosion; impossible to maintain	Not viable

The classify-then-dispatch shape is a standard divide-and-conquer move translated into agent skills: classifier picks the bucket; specialist solves within the bucket.

Specialist skill contents¶

Each specialist skill bundles three things:

Failure-mode taxonomy for that category — what kinds of flake show up here (e.g. for unit tests: async timing, mocks, fake timers, isolation).
Fix patterns matched to failure modes — well-known remediations the team has used before (encoded experience).
Reproduction instructions — how to reliably repro the flake locally before fixing. "Our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." The CPU-throttled loop is the agent's "is this still flaky after my fix?" verifier.

Why CPU-throttled reproduction is load-bearing¶

Flaky tests by definition do not always fail on a single local run; the developer-laptop / CI difference is one of the canonical flaky-test root causes. Mimicking CI's constrained environment is the discipline that makes:

Reproduction before fixing — agent confirms the flake exists on its machine, not just CI's.
Verification after fixing — agent confirms the flake is gone after applying the fix pattern.

Without throttled-loop reproduction, the agent's "green on first run" signal is too weak to gate the PR.

The classifier itself¶

The post does not describe the classifier in depth. Plausible implementations:

Path-pattern matching — tests/unit/** → unit; tests/integration/** → integration; tests/visual/** → visual. Simple and reliable when the repo has a clean layout.
Test framework / glob heuristics — *.spec.ts running under Jest with a jsdom environment → unit; *.e2e.ts under Playwright → integration; *.snap files present → visual regression.
LLM-based classification — read the failing test file and ask the agent to classify; flexible but adds a classifier round-trip.

Atlassian doesn't disclose which they use; the path-pattern + framework-heuristic approach is the cheapest and most reliable default.

Composes with¶

patterns/jira-status-transition-triggers-agent-workflow — the upstream trigger that fires the agent on the flaky-test work item.
patterns/agent-skill-with-fallback-chain — sibling skill-dispatch pattern on a different selector axis (per-codebase vs. per-test-category). The two compose: a flaky-test agent could classify on category, then within that, fall back from repo-specific specialist skill → flag for skill creation → generic specialist skill.
concepts/agent-orchestration-skill — the unit being dispatched.
concepts/agent-as-first-pass-investigator — the operational model around the dispatched skill.

Operational outcome¶

Atlassian's Jira-team flaky-test workflow:

Pre-automation: ~1 flaky test/day × ~2 hours/test ≈ 10 hours/week.
Post-automation: "~80% reduction in eng hours spent on flaky tests"; ~1 engineering week saved per month.

The category-specialist dispatch is one of several load-bearing components of this outcome (along with triage-vs-fix split, work-item-as-prompt, and human review gate).

Caveats¶

Misclassification cost. A test classified as "unit" but actually integration loads the wrong skill, the agent applies unit-test fix patterns, fails to reproduce against the real flake, and either gives up or proposes a wrong fix. Classifier precision is the load-bearing input.
Category drift over time. As frameworks evolve (e.g. a new test type — contract tests, fuzz tests, performance regression tests), the classifier and the specialist skill set must be extended. Governance not described.
Specialist skill content can go stale. A unit-test specialist skill written for the team's mocking library three years ago may not match current conventions. No freshness contract described.
Reproduction isn't always reproducible. Some flakes depend on real-network / real-CDN / real-browser conditions the agent's local CPU-throttling can't simulate; the post acknowledges this implicitly ("mimic CI condition as closely as possible") but doesn't disclose escape rate.

Sibling on adjacent KTLO categories¶

The category-classifier-then-specialist pattern can be applied to:

Vulnerability remediation: classify by CVE category (deserialization, XSS, SSRF, dependency-CVE) → specialist fix pattern.
Accessibility fixes: classify by WCAG criterion → specialist remediation skill.
Bug long-tail: classify by failure-mode (null-pointer, off-by-one, race) → specialist debugging skill.

Atlassian only documents the flaky-test instance; the pattern generalises across KTLO categories with similar divide-and-conquer dispatch shape.

Seen in¶

sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in — canonical wiki source. Three named specialist skills (unit / integration / visual regression) with bullet-list failure-mode taxonomies; CPU-throttled-reproduction discipline; load-bearing for the ~80% flaky-test eng-hour reduction.

concepts/flaky-test — substrate.
concepts/agent-orchestration-skill — the unit being dispatched.
concepts/work-item-as-agent-prompt — the substrate framing the trigger arrives through.
concepts/agent-as-first-pass-investigator — the operational model.
concepts/ktlo-engineering-chores — the work category.
concepts/agent-context-window — the budget the classify-then-dispatch shape is optimising against.
patterns/jira-status-transition-triggers-agent-workflow — upstream trigger.
patterns/agent-skill-with-fallback-chain — sibling dispatch pattern on a different selector axis.
patterns/agent-orchestration-meta-skill — sibling greenfield-axis orchestration skill.
systems/rovo-dev — likely consuming agent.