Skip to content

ATLASSIAN 2026-06-01 Tier 3

Read original ↗

Atlassian — How We Cut up to 80% of Engineering "Chores" Using AI Agents in Jira

Summary

A first-party engineering post from the Jira team describing how they cut up to 80% of engineering time spent on KTLO ("keeping the lights on") chores by using AI agents inside Jira workflows. The architectural argument is that Jira itself — work items, custom fields, workflow state machines, and workflow transition automations — is the substrate for orchestrating agent work: each work item is a structured prompt; each workflow status change can trigger an agent with a custom system prompt; each PR opened by an agent is a draft pending human review. The post gives two production examples on Atlassian's own Jira repo: (1) flaky-test triage and fix, where a test-category classifier dispatches to a unit / integration / visual-regression specialist skill, runs reproduction under CPU-throttled conditions, and opens a draft PR with the proposed fix — saving roughly one engineering week per month (≈80% reduction in flaky-test eng hours), one flaky test/day at ~2 hours/test pre-automation; and (2) stale feature flag cleanup, where a daily cron job creates Jira work items per stale flag with flag name + flag type + repo + file paths + line numbers + desired final state, then engineers transition the status to delegate to an agent which executes a three-tier fallback skill chain — repository-specific cleanup skill → flag the repo as needing one + provide owner instructions → generic cleanup skill — producing 500+ merged PRs in 70 days. The shared thesis is the KTLO-as-pattern-recognition framing: "That pattern recognition is what makes delegation to agents possible. We know what good cleanup looks like, so we can define clear parameters, build review checkpoints, and design a human-in-the-loop system that produces code meeting our standards."

Key takeaways

  • KTLO is the right delegation target. Atlassian frames KTLO ("keeping the lights on") explicitly as the category of work that is "small, but important maintenance tasks nobody wants to spend time on. This includes work like cleaning up old feature flags, chasing flaky tests, fixing identified vulnerabilities, addressing accessibility issues, and chipping away at a long tail of bugs." The selection criterion for delegating to agents is pattern recognition over years of doing the work: "Our team has spent years fixing these exact categories of issues. That pattern recognition is what makes delegation to agents possible." The inverse claim is implicit: delegating greenfield design work the team has never done is much harder. Canonicalised here as concepts/ktlo-engineering-chores. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Jira is the orchestration substrate, not just the tracker. "Jira is the heart of our strategy. Each work item is a record of the tasks we need to complete and acts as a prompt for agents. All the context an agent needs is shared using the work item, Atlassian's Teamwork Graph, and the explicit instructions we include in our workflow automations." The work item carries structured fields (flag name, flag type, repo, file paths, line numbers, desired final state) that become the agent's prompt; the workflow state machine carries the human decision ("approve this for an agent to start"); and the workflow transition carries the custom system prompt that tells the agent how to do the work. Canonicalised as concepts/work-item-as-agent-prompt and patterns/jira-status-transition-triggers-agent-workflow. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Agent does the first pass; engineer does the merge gate. The load-bearing operational model is that agents do investigation, diagnosis, and proposed fix as a draft PR, then a human reviews before merge. "The agent handles the repetitive first pass: investigation, diagnosis, and a proposed fix. Engineers validate the change before it is merged. What used to require hours of manual investigation can now become minutes of review." When the agent's diagnosis is "this is a false positive", it "can stop and summarise that outcome, commenting on the original Jira work item" — the agent is allowed to short-circuit the workflow with a bounded action (comment, no PR). Canonicalised as concepts/agent-as-first-pass-investigator; sibling to the greenfield-work patterns/agentic-pr-triage pattern, where the agent picks issues from a queue and drafts PRs against the team's product code. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Specialised skills per failure category, not one mega-skill. For flaky tests, Atlassian first classifies the test type (unit / integration / visual regression) and dispatches to a category specialist skill: "Unit test specialist: focuses on asynchronous timing issues, mocks, fake timers, and test isolation. Integration test specialist: focuses on browser automation issues, network races, page stability, and environment setup. Visual regression specialist: focuses on deterministic rendering, snapshot updates, image diffs, and visual test stability." The classification step preserves agent context budget — a single mega-skill with all failure modes would either not fit in context or dilute the specialist guidance. Each skill also includes reproduction instructions that use slower / CPU-throttled conditions to mimic CI's worst case: "our agents can run the failing test repeatedly under slower or CPU-throttled conditions to mimic CI condition as closely as possible." Canonicalised as patterns/test-category-classifier-then-specialist-skill. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Heuristic cron job emits agent-ready work items. For stale feature flag cleanup, Atlassian "run[s] a daily cron job that creates and updates Jira work items for every stale flag including: Flag name and type … Repository and code references: the exact repo, file paths, and line numbers where the flag appears. Desired final state: what the code should look like once the flag is removed. For example, for a rollout gate, this is typically the 'on' or 'off' branch to preserve. For experiments, it may be the winning cohort, a specific variant's behavior, or a custom path defined by the experiment owner." The cron job's heuristic is the embodiment of years of human pattern recognition; the work item is the agent's complete brief. Canonicalised as patterns/heuristic-cron-emits-agent-work-items. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Three-tier skill fallback chain handles per-codebase variability. "Atlassian hosts thousands of repositories owned by hundreds of teams, each with their own codebases and conventions so a 'one-size-fits-all' doesn't work. We've encoded our cleanup experience into repository-specific agent skills, and the system prompt gives each agent a clear fallback path: (1) If available, use the repository's existing cleanup procedure that gives the agent purpose-built guidance for that codebase. (2) Flag repositories that could benefit from a dedicated skill, and provide the repo owners with instructions to generate a cleanup procedure. (3) Fallback to a generic cleanup skill that works across most codebases." Tier 2 is the interesting move — the agent's failure mode of "no purpose-built skill exists" is itself converted into operational signal that prompts the team to author one. Canonicalised as patterns/agent-skill-with-fallback-chain; deepens concepts/agent-orchestration-skill with the per-codebase variant. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Triage step gets its own custom prompt, distinct from the fix step. For flaky tests, "when a ticket is created, our workflow starts by delegating triage to an agent to verify whether the issue is real using a custom prompt. If it looks like a false positive, the agent can stop and summarise that outcome, commenting on the original Jira work item." The triage step's exit conditions are: (a) false positive → comment-only, (b) reproducible → enter fix workflow. Splitting triage from fix is what makes the agent's behaviour bounded — a triage agent that opens a PR for a false-positive flaky test is a worse failure than one that doesn't. Generalises patterns/agentic-pr-triage to maintenance work and adds the triage-as-bounded-stage discipline. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Operational outcome — flaky tests. "Our team previously spent two hours resolving a flaky test. We encountered roughly one flaky test per day, sometimes more. Each required an engineer to inspect the CI failure, reproduce the issue locally or in CI-like conditions, determine if the problem was in the test or product code, and prepare a fix. Now that we've implemented agentic workflows with Jira, we save roughly one engineering week every month, which means we've reduced eng hours spent on flaky tests by up to 80%." That implies pre-automation cost ≈ 1 test/day × 2 h ≈ 10 h/week ≈ 40 h/month; post-automation saves ~32 h/month (one eng-week), leaving ~8 h/month residual on agent review + escapes. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • Operational outcome — feature flag cleanup. "We built a system in Jira to automate the bulk of this work. So far, it's responsible for more than 500 merged PRs in the past 70 days." ≈ 7 merged PRs/day on this single category of KTLO chore on the Jira repo alone. PRs are not throwaway: each is a real flag-removal commit going through the standard review and merge-queue path. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

  • The flag-state-across-systems problem motivates the heuristic cron. "On a large, multi-product codebase, cleanup is harder than 'just removing the code.' A flag might be fully rolled out for some customers but still active for others due to compliance requirements, release tracks, or experiment holdouts. Piecing together a flag's true state across systems was manual and error-prone. The work was tedious, time-consuming, and repetitive — perfect for agents." The heuristic cron's real value is encoding the "piece together state across systems" lookup so the agent receives a fully resolved brief — agent doesn't need to re-discover the flag's true state. (Source: sources/2026-06-01-atlassian-how-we-cut-up-to-80-of-engineering-chores-using-ai-agents-in)

Architecture sketch

                     KTLO chore detected
                  ┌─────────┴──────────┐
                  ▼                    ▼
           Daily cron scan        Test failure / signal
           (stale flag heuristic)
                  │                    │
                  ▼                    ▼
        Jira work item created (Flag + type + repo + paths + lines + desired final state)
        Workflow state: Open / Awaiting review
                  │  ── human decides this is real KTLO work, transitions status
        Workflow state: In Progress / Delegated to Agent
                  │  ── workflow transition triggers agent with CUSTOM SYSTEM PROMPT
                ┌───────────────── Agent run ─────────────────┐
                │                                              │
                │   Stage 1: TRIAGE (custom triage prompt)      │
                │     • Verify issue is real                    │
                │     • If false positive: comment, STOP        │
                │                                              │
                │   Stage 2: SKILL DISPATCH                     │
                │     For flaky tests:                          │
                │       classify(unit | integration | visual)   │
                │       → category specialist skill             │
                │     For flag cleanup:                         │
                │       FALLBACK CHAIN:                         │
                │         1) repo-specific skill                │
                │         2) flag repo for dedicated skill      │
                │         3) generic cleanup skill              │
                │                                              │
                │   Stage 3: REPRODUCE + FIX                    │
                │     • Run failing test under CPU throttling   │
                │     • Apply fix pattern                       │
                │     • Open DRAFT PR + comment on Jira         │
                │                                              │
                └───────────────────┬──────────────────────────┘
                       Engineer reviews draft PR
                       (minutes of review, not hours
                        of investigation)
                       Merge → Jira work item closes

Operational numbers

Domain Pre-automation Post-automation Source
Flaky test resolution ~2 hours per flaky test, ~1 flaky test/day ~80% reduction in eng hours; ~1 eng-week/month saved Source post
Feature flag cleanup PRs manual, "tedious, time-consuming" 500+ merged PRs in 70 days Source post
Per-flag cleanup difficulty manual cross-system state piecing Cron-resolved brief in Jira; agent picks up Source post

Architectural decisions

  • Jira work item as agent prompt. Structured fields (flag name, type, repo, paths, line numbers, desired final state) carry the agent's full context; agent doesn't query other systems for the brief. (concepts/work-item-as-agent-prompt)
  • Workflow status transition as agent trigger. Status change by an engineer transitions the work item from human-review-queue into the agent-execution lane; the transition automation passes a custom system prompt to the agent. (patterns/jira-status-transition-triggers-agent-workflow)
  • Triage agent and fix agent are separate stages with separate prompts. Triage exits with a comment (false positive) or with a transition into fix; fix opens a draft PR. Splitting prevents the "agent opens a PR for a false positive" failure mode.
  • Specialised skills per failure category, not one mega-skill. Test-type classifier dispatches to unit / integration / visual specialists. (patterns/test-category-classifier-then-specialist-skill)
  • Three-tier skill fallback chain. Repo-specific skill → flag for dedicated skill creation → generic skill. Per-codebase variability handled at the skill layer, not the agent layer. (patterns/agent-skill-with-fallback-chain)
  • Daily cron emits work items with full pre-resolved context. Cron does the cross-system state-piecing; agent receives a ready-to-execute brief. (patterns/heuristic-cron-emits-agent-work-items)
  • Human-in-the-loop at merge. Agent prepares draft PR; engineer reviews + merges. CI / merge queue (systems/bitbucket-merge-queues on Jira repo) acts as the automated quality gate. (concepts/agent-as-first-pass-investigator)
  • Reproduction under CPU throttling. Skill includes instructions to mimic CI's slower-than-laptop conditions, so intermittent failures reproduce.

Caveats

  • Agent identity not named. The post says "agent" and "AI agents" throughout but doesn't explicitly name Rovo Dev as the agent. The link to Atlassian's docs on collaborate-on-work-items-with-ai-agents points to the Jira-Cloud-native agent surface; whether this is Rovo Dev, a different Atlassian agent, or a generic third-party agent integration is not disclosed.
  • No model / token / cost disclosure. The post claims an 80% time reduction but does not disclose LLM cost-per-PR, model used, token budget, or end-to-end latency from work-item-created to draft-PR-opened. Real KTLO economics depend on this — Dependabot is free, an LLM agent is not.
  • No false-positive escape rate. Flaky-test triage is the cheapest stage (comment-only) but the post does not disclose what fraction of triage decisions are wrong (real flake classified as false positive, or vice versa).
  • No "agent gets stuck" recovery story. When the agent loops unproductively, what's the timeout? When the fallback chain exhausts (no repo-specific skill, generic skill doesn't apply), what's the failure-mode contract?
  • No before/after PR-quality comparison. The 500-PRs-in-70-days number is throughput, not quality; revert rate, post-merge defect rate, and review-comment density vs. human-authored PRs are not disclosed.
  • Atlassian's own scale only. The post is internal Jira team experience; whether this generalises to teams with different KTLO categories (e.g. dependency upgrades, vulnerability patches) is asserted ("if your team is spending engineering hours on work that follows a repeatable pattern …") but not demonstrated.
  • Implicit dependency on merge queues. The 500-PRs-in-70-days throughput on the Jira repo only makes sense because Jira already has Bitbucket Merge Queues in place (sources/2026-04-29-atlassian-inside-atlassians-merge-queues) — without semantic-merge-conflict prevention, 7 agent-PRs/day would compound the very flakiness the agents are trying to eliminate. The merge queue substrate is load-bearing for the pattern's economics; the post doesn't call this out.
  • Doesn't disclose how Teamwork Graph is actually used. The post names "Atlassian's Teamwork Graph" as one of the context substrates ("All the context an agent needs is shared using the work item, Atlassian's Teamwork Graph, and the explicit instructions we include in our workflow automations") but doesn't describe what specifically the agent retrieves from it or how it's accessed (MCP server? API? Jira-side fetch?).

Sibling pattern cluster on this wiki

This source extends the wiki's existing agentic-development cluster (concepts/agentic-development-loop, systems/rovo-dev, systems/atlassian-fireworks) along a second axis: KTLO maintenance work, not greenfield development:

  • Greenfield axissources/2026-04-24-atlassian-rovo-dev-driven-development (Fireworks built in 4 weeks by LLMs): inner loop is dev shard + AI-written e2e tests; outer loop is three parallel workspaces; pre-human review is adversarial-sub-agent.
  • Maintenance axis (this source): inner loop is triage-then-fix with category-specialised skills; outer loop is daily cron + Jira workflow status transitions; pre-human review is draft-PR-with-CI / merge-queue gate.

Both axes share the agent does first pass, human does merge gate contract (concepts/agent-as-first-pass-investigator) and the skill as procedural-knowledge unit (concepts/agent-orchestration-skill).

The maintenance axis is also a sibling to:

Source

Last updated · 542 distilled / 1,571 read