Skip to content

ATLASSIAN 2026-04-24

Read original ↗

Atlassian — Rovo Dev Driven Development: How we built a platform in 4 weeks

Summary

A first-person engineering post from the builder of Fireworks — Atlassian's Firecracker-microVM orchestrator on Kubernetes, the "secure execution engine behind Atlassian's AI agent infrastructure." The system accepts an OCI container image plus a command, boots a hardware-isolated Firecracker VM, and runs the workload, with 100ms warm starts, live migration between hosts, eBPF network-policy enforcement, shared volumes, snapshot filesystem restore, and sidecar sandboxes. The implementation included a scheduler, autoscaler, node agents, Envoy ingress layers, and Raft persistence, "built in four weeks, entirely by LLMs." The post's real contribution is the engineering workflow that made that timeline possible: three parallel git workspaces each running an agent on a different branch; AI writing its own e2e tests and looping against a real isolated dev shard on the shared AWS-scms Kubernetes cluster until they pass; an adversarial !review-pr sub-agent spun up before any human looks at the diff; and a workflow meta-skill that encodes the Fireworks golden-path end-to-end loop. The core architectural claim: with no hand-written code, validation is everything"test outputs, not read code" — so the safety net shifts from manual review to CI/CD pipelines, sharding, blast-radius control via RBAC + JIT access, canary deploys across multiple clusters, and AI-written e2e tests that are "the primary validation harness."

Key takeaways

  • Fireworks is Firecracker-on-Kubernetes, with full production surface. "You submit an OCI container image and a command, and it boots a hardware-isolated Firecracker VM, runs your workload. Features include 100ms warm starts, live migration between hosts, eBPF network policy enforcement, shared volumes, and snapshot filesystem restore, sidecar sandboxes. To do this we had to build a scheduler, autoscaler, node agents, envoy ingress layers, raft persistence, and much more." This is a named production instance of the micro-VM substrate-on-Kubernetes shape — the same architectural move systems/fly-kubernetes makes, applied to Atlassian's internal agent-execution substrate rather than a public cloud IaaS offering. (Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)
  • Four weeks, by LLMs, end-to-end. "Even two months ago, I wouldn't have believed we'd have a Firecracker-based microVM platform with 100ms warm starts and live migration between hosts, built in four weeks, entirely by LLMs." This is the headline throughput claim the rest of the post justifies. The claim is not that the LLM designed the system in a vacuum — the engineer describes themselves as "more of an architect and builder" who "explores architecture options" with the agent, then "let[s] it implement." The shift is from writing lines of code to specifying, reviewing, and validating them.
  • Three parallel workspaces, three agents, one human. "Three workspaces, each checked out on a different branch, each with an agent working. Split terminal: agent on one side, a shell on the other so I can poke at things while it works. Always have something running. If your agents are idle, you're leaving productivity on the table." Canonicalised as patterns/three-workspace-parallel-agent-workflow. The dispatcher model: the human queues prompts, reads along to the thinking, anticipates when each agent will return, and has the next task ready. Productivity is measured in agents-busy ratio, not keystrokes-per-minute.
  • AI writes the e2e tests; e2e tests are the primary validation harness. "With no hand-written code, validation is everything. [...] AI writes the e2e tests too. The agent writes tests, deploys to a dev shard, runs them, and loops on failures until they pass. The test suite is the primary proof that things work." Canonicalised as patterns/ai-writes-own-e2e-tests. The architectural inversion: if you're not reading code, you can't rely on code review as your safety net, so the test suite has to carry the correctness guarantee alone. The LLM writing the tests is acceptable because "if you're reading any code, read the tests." Human code review shifts up the stack: architecture, design intent, risk — "rather than nitpicking details."
  • Dev shards are the iteration substrate. "Dev shard loop: Every feature gets deployed to an isolated dev shard on a real cluster. The agent deploys, tests e2e, fixes issues, redeploys. This catches integration issues that unit tests miss." Each developer has "real independent shards ... that won't break anyone else." Canonicalised as patterns/dev-shard-iteration-loop — a named instance of the general agentic development loop where the execution environment is a full production-like Kubernetes cluster shard, not a container or laptop emulator. "Just like you wouldn't expect a human to ship working code without access to a real environment, your AI needs end-to-end access too."
  • Adversarial !review-pr sub-agent before any human. "For review, have an adversarial persona subagent that spins up and reviews what the main agent has written. I have this one tied to a !review-pr prompt shortcut that spins it up as an independent subagent." Canonicalised as patterns/adversarial-review-subagent — a specialised concepts/adversarial-review-persona that runs before the human is in the loop at all, so by the time a human reviews the PR, the obvious issues are gone. "For bigger, scarier PRs: spin up an independent agent to review before a human even looks at it." This is the pre-human tier in patterns/pre-human-agent-review.
  • Agent skills encode the golden-path loop. "Skills are useful for specific domains or common actions within your repo. Internally we've built lots of skills! Skills for PRs, using CLI, specific domains like Raft, gRPC. We've built a meta- workflow/orchestration skill for Fireworks development. It doesn't do one narrow technical thing, instead it gives the agent a set of 'golden path' loops for how to work on Fireworks changes end-to-end." Canonicalised as concepts/agent-orchestration-skill and patterns/agent-orchestration-meta-skill — the meta-skill is deliberately not a narrow tool-binding; it's a procedural runbook the agent consults for multi-step workflows. A second named example: a skill that "automates deploying, operating, and tearing down isolated Fireworks dev shards on the shared AWS scms Kubernetes cluster."
  • Black-box validation over code reading. "If I need to verify, I test outputs, not read code. Submit a job, check it boots in 100ms, verify migration preserves state, confirm network policy blocks what it should. Black box validation." Canonicalised as concepts/black-box-validation. The load-bearing reframing: when you stop hand-writing code, reading the code is no longer the cheapest validation path — "Treat code as a black box. If you can comprehensively validate via inputs and outputs, you often don't need to read the code and what it's doing." The human's job becomes specifying observable invariants (boot in 100ms, network policy blocks X, migration preserves state) and writing tests that assert them.
  • CI/CD is the automated quality gate. "CI pipeline as quality gate: Every PR runs lint, vet, tests, and Helm validation. The agent reads pipeline output and addresses failures before requesting review." Canonicalised as patterns/ci-as-agent-quality-gate — the agent is explicitly inside the CI/CD loop, reading output and addressing failures autonomously before a human review is requested. This composes with the dev-shard loop (the first tier of testing) and the adversarial review sub-agent (the pre-human correctness tier).
  • Safety net shifts from review to architecture. "If you're not hand-writing code, your safety net shifts: CI/CD pipelines (automated quality gate), Sharding (limit the blast radius of any single change), RBAC / JIT access (control who — and what — can write), Progressive rollouts & canary deploys across multiple clusters, AI-written e2e tests (primary validation harness)." The five-lever safety net, canonicalised as patterns/rbac-jit-as-agent-safety-net (access control lever) and reinforcing concepts/blast-radius framing. "Main deploys to dev without PRGB, so we can validate internally fast. Production gets canary deploys across multiple clusters."
  • Agentic teams, not just agentic individuals. "If you're blocked on human review, your throughput is gated by the slowest reviewer. Teams need to embrace AI-assisted reviews and shift their attention to the high level [...] rather than nitpicking details. The agents can handle the details." The author explicitly frames this as a team-level constraint: the individual's agent throughput is capped by the team's human PR-review latency, so the team has to move to "a prod branching model" where "main deploy[s] to dev without PRGB (Peer Review / Green Build)", with Production behind canary guards.

Systems named

System Role
systems/atlassian-fireworks The Firecracker-microVM orchestrator on Kubernetes; Atlassian's secure AI-agent execution engine.
systems/firecracker The micro-VM monitor Fireworks uses; provides the hardware-isolation boundary per workload.
systems/kubernetes The substrate Fireworks runs on; the "shared AWS scms Kubernetes cluster" is where dev shards live.
systems/envoy Fireworks' ingress-layer proxy — named verbatim as "envoy ingress layers."
systems/bitbucket-pipelines The CI pipeline Rovo Dev reads output from when reviewing PRs.
systems/rovo-dev Atlassian's AI development agent surface — runs the agentic loop; built-in Bitbucket + Pipelines integration is called out as "genuinely great."

Concepts extracted

Concept One-liner
concepts/hardware-isolated-microvm-on-kubernetes Compose Kubernetes (scheduling / networking / lifecycle) with Firecracker (VM-grade isolation) to run mutually distrusting workloads on shared K8s nodes.
concepts/black-box-validation Validate by observable inputs and outputs, not by reading the code — the primary validation path when LLMs write the code.
concepts/agentic-development-loop The LLM → execution-environment → feedback → LLM closed loop. (Existing page, extended.)
concepts/ai-writes-own-tests The agent writes the e2e tests, not just the production code; if you're reading any code, read the tests.
concepts/agent-orchestration-skill A skill that encodes a multi-step golden-path workflow rather than a single narrow tool-binding.
concepts/adversarial-review-persona A sub-agent prompted as an adversarial reviewer that critiques the main agent's output before any human is in the loop.
concepts/blast-radius Designing so a single failure / change doesn't take everything with it. (Existing page, extended.)

Patterns extracted

Pattern One-liner
patterns/ai-writes-own-e2e-tests The agent writes the end-to-end test suite, deploys to a real dev shard, runs tests, and loops on failures until green — the test suite becomes the primary correctness proof.
patterns/dev-shard-iteration-loop Every feature ships to an isolated, real-cluster dev shard for e2e iteration, catching integration bugs that unit tests miss.
patterns/adversarial-review-subagent A !review-pr prompt spawns an independent sub-agent with an adversarial prompt that reviews the main agent's PR before a human does.
patterns/agent-orchestration-meta-skill A skill that encodes the golden-path loops for working on a specific codebase end-to-end, not a single narrow tool binding.
patterns/pre-human-agent-review For bigger / scarier PRs, spin up an independent reviewer agent before a human even looks at it, so human review time is spent on architecture, not nitpicks.
patterns/three-workspace-parallel-agent-workflow Three parallel checkouts, three branches, three agents; the human is a dispatcher reading thinking and queueing prompts.
patterns/ci-as-agent-quality-gate The agent is inside the CI loop: reads lint / vet / test / Helm-validation output and addresses failures autonomously before requesting human review.
patterns/rbac-jit-as-agent-safety-net Shift the safety net from manual review to RBAC + JIT access control over who and what can write to production.

Architectural sketch — Fireworks (as described)

                  Developer / LLM agent
                         │ submit(OCI image, command)
             ┌───────────────────────┐
             │      Fireworks API    │  (Envoy ingress layers)
             └───────────┬───────────┘
                  ┌──────┴──────┐
                  │  Scheduler  │◄────── Autoscaler
                  │ (Raft-backed│
                  │  persistence)│
                  └──────┬──────┘
                         │ place(VM)
         Kubernetes cluster (AWS scms shared)
                   ┌─────┴─────┐
                   │ Node agent│ ─── runs per node
                   └─────┬─────┘
                         │ boot / snapshot-restore / migrate
               ┌───────────────────┐
               │ Firecracker µVM   │  hardware isolation
               │  (guest workload) │  100ms warm start
               └───────────────────┘
                 eBPF network policy
                 (enforcement in-kernel)

Features surfaced at this layer:
  • 100ms warm starts (snapshot-based fast restore)
  • live migration between hosts (VM-level state transfer)
  • eBPF network policy enforcement (in-kernel ingress/egress)
  • shared volumes (cross-VM data plane)
  • snapshot filesystem restore (fast clone)
  • sidecar sandboxes (multi-VM co-location per workload)

Operational numbers

Metric Value Source phrasing
Warm start latency 100 ms "100ms warm starts"
Time to build the platform 4 weeks "built in four weeks, entirely by LLMs"
Parallel agents per developer 3 "Three workspaces, each checked out on a different branch, each with an agent working"
Deploy target for main (pre-prod) dev "main deploy[s] to dev without PRGB"
Production rollout shape canary across multiple clusters "Production gets canary deploys across multiple clusters"
CI validation scope lint, vet, tests, Helm validation "Every PR runs lint, vet, tests, and Helm validation"

Caveats

  • The claim "entirely by LLMs" is builder-level, not audit-level. The post doesn't disclose lines-of-code metrics, what fraction of final code was human-edited, commit-author proportions, or third-party / open-source code composition. A Firecracker orchestrator inherits substantial upstream code (systems/firecracker itself, k8s client libraries, systems/envoy config, Raft libraries) that nobody hand-writes. The headline is best read as "the first-party integration & orchestration code in Fireworks was generated by LLMs under the builder's supervision, in four weeks."
  • Post is product PR and architecture. This is an Atlassian Rovo Dev marketing post. It passes the Tier-3 scope filter because it names a real production system (Fireworks) with concrete features (100ms warm starts, live migration, eBPF, Raft) and a real workflow (dev shards on the shared AWS scms K8s cluster, !review-pr sub-agents, meta- workflow skills) — the architectural content crosses the 20% threshold AGENTS.md uses for product launches.
  • No numbers on agent success-rate, loop-iterations-per-PR, false-positive rate on adversarial review. Throughput metrics ("agents busy") and platform-level metrics (100ms warm start) are disclosed; the agent-quality / loop-convergence metrics that would let an outside reader reproduce the "4 weeks" headline are not.
  • Team shape is implicit, not disclosed. "Your team needs to be agentic too" names a constraint; it doesn't say how large the team is, or what fraction of the Fireworks codebase came from the single author versus teammates working the same workflow. The three-workspaces pattern describes one developer's setup, not a fleet-level organisational norm.
  • Tier-3 reliability note. The Atlassian Engineering blog is Tier-3 on the sysdesign-wiki; treat architectural claims as what the builder says they shipped rather than independently verified production behavior. The five-year-longitudinal track record of Tier-1 sources (Netflix, AWS, Cloudflare) is not yet established here.

Source

Last updated · 510 distilled / 1,221 read