Summary
A first-person engineering post from the builder of Fireworks —
Atlassian's Firecracker-microVM orchestrator on Kubernetes, the
"secure execution engine behind Atlassian's AI agent infrastructure."
The system accepts an OCI container image plus a command, boots a
hardware-isolated Firecracker VM, and runs the workload, with
100ms warm starts, live migration between hosts, eBPF
network-policy enforcement, shared volumes, snapshot
filesystem restore, and sidecar sandboxes. The implementation
included a scheduler, autoscaler, node agents, Envoy ingress
layers, and Raft persistence, "built in four weeks, entirely by
LLMs." The post's real contribution is the engineering workflow
that made that timeline possible: three parallel git workspaces each
running an agent on a different branch; AI writing its own e2e
tests and looping against a real isolated dev shard on the
shared AWS-scms Kubernetes cluster until they pass; an adversarial
!review-pr sub-agent spun up before any human looks at the diff;
and a workflow meta-skill
that encodes the Fireworks golden-path end-to-end loop. The core
architectural claim: with no hand-written code, validation is
everything — "test outputs, not read code" — so the safety net
shifts from manual review to CI/CD pipelines, sharding,
blast-radius control via RBAC + JIT access,
canary deploys across multiple clusters, and AI-written e2e tests
that are "the primary validation harness."
Key takeaways
- Fireworks is Firecracker-on-Kubernetes, with full production
surface. "You submit an OCI container image and a command, and
it boots a hardware-isolated Firecracker VM, runs your workload.
Features include 100ms warm starts, live migration between hosts,
eBPF network policy enforcement, shared volumes, and snapshot
filesystem restore, sidecar sandboxes. To do this we had to build
a scheduler, autoscaler, node agents, envoy ingress layers, raft
persistence, and much more." This is a named production instance
of the micro-VM
substrate-on-Kubernetes shape — the same architectural move
systems/fly-kubernetes makes, applied to Atlassian's internal
agent-execution substrate rather than a public cloud IaaS offering.
(Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)
- Four weeks, by LLMs, end-to-end. "Even two months ago, I
wouldn't have believed we'd have a Firecracker-based microVM
platform with 100ms warm starts and live migration between hosts,
built in four weeks, entirely by LLMs." This is the headline
throughput claim the rest of the post justifies. The claim is not
that the LLM designed the system in a vacuum — the engineer
describes themselves as "more of an architect and builder" who
"explores architecture options" with the agent, then "let[s]
it implement." The shift is from writing lines of code to
specifying, reviewing, and validating them.
- Three parallel workspaces, three agents, one human. "Three
workspaces, each checked out on a different branch, each with an
agent working. Split terminal: agent on one side, a shell on the
other so I can poke at things while it works. Always have
something running. If your agents are idle, you're leaving
productivity on the table." Canonicalised as
patterns/three-workspace-parallel-agent-workflow. The
dispatcher model: the human queues prompts, reads along to the
thinking, anticipates when each agent will return, and has the
next task ready. Productivity is measured in agents-busy ratio,
not keystrokes-per-minute.
- AI writes the e2e tests; e2e tests are the primary validation
harness. "With no hand-written code, validation is everything.
[...] AI writes the e2e tests too. The agent writes tests, deploys
to a dev shard, runs them, and loops on failures until they pass.
The test suite is the primary proof that things work."
Canonicalised as patterns/ai-writes-own-e2e-tests. The
architectural inversion: if you're not reading code, you can't
rely on code review as your safety net, so the test suite has to
carry the correctness guarantee alone. The LLM writing the tests
is acceptable because "if you're reading any code, read the
tests." Human code review shifts up the stack: architecture,
design intent, risk — "rather than nitpicking details."
- Dev shards are the iteration substrate. "Dev shard loop:
Every feature gets deployed to an isolated dev shard on a real
cluster. The agent deploys, tests e2e, fixes issues, redeploys.
This catches integration issues that unit tests miss." Each
developer has "real independent shards ... that won't break
anyone else." Canonicalised as
patterns/dev-shard-iteration-loop — a named instance of the
general agentic development
loop where the execution environment is a full production-like
Kubernetes cluster shard, not a container or laptop emulator.
"Just like you wouldn't expect a human to ship working code
without access to a real environment, your AI needs end-to-end
access too."
- Adversarial
!review-pr sub-agent before any human. "For
review, have an adversarial persona subagent that spins up and
reviews what the main agent has written. I have this one tied to
a !review-pr prompt shortcut that spins it up as an independent
subagent." Canonicalised as
patterns/adversarial-review-subagent — a specialised
concepts/adversarial-review-persona that runs before the
human is in the loop at all, so by the time a human reviews the
PR, the obvious issues are gone. "For bigger, scarier PRs: spin
up an independent agent to review before a human even looks at
it." This is the pre-human tier in
patterns/pre-human-agent-review.
- Agent skills encode the golden-path loop. "Skills are useful
for specific domains or common actions within your repo.
Internally we've built lots of skills! Skills for PRs, using CLI,
specific domains like Raft, gRPC. We've built a meta-
workflow/orchestration skill for Fireworks development. It
doesn't do one narrow technical thing, instead it gives the agent
a set of 'golden path' loops for how to work on Fireworks changes
end-to-end." Canonicalised as
concepts/agent-orchestration-skill and
patterns/agent-orchestration-meta-skill — the meta-skill is
deliberately not a narrow tool-binding; it's a procedural
runbook the agent consults for multi-step workflows. A second
named example: a skill that "automates deploying, operating,
and tearing down isolated Fireworks dev shards on the shared AWS
scms Kubernetes cluster."
- Black-box validation over code reading. "If I need to verify,
I test outputs, not read code. Submit a job, check it boots in
100ms, verify migration preserves state, confirm network policy
blocks what it should. Black box validation." Canonicalised as
concepts/black-box-validation. The load-bearing reframing:
when you stop hand-writing code, reading the code is no longer
the cheapest validation path — "Treat code as a black box. If
you can comprehensively validate via inputs and outputs, you
often don't need to read the code and what it's doing." The
human's job becomes specifying observable invariants (boot
in 100ms, network policy blocks X, migration preserves state)
and writing tests that assert them.
- CI/CD is the automated quality gate. "CI pipeline as quality
gate: Every PR runs lint, vet, tests, and Helm validation. The
agent reads pipeline output and addresses failures before
requesting review." Canonicalised as
patterns/ci-as-agent-quality-gate — the agent is explicitly
inside the CI/CD loop, reading output and addressing failures
autonomously before a human review is requested. This composes
with the dev-shard loop (the first tier of testing) and the
adversarial review sub-agent (the pre-human correctness tier).
- Safety net shifts from review to architecture. "If you're
not hand-writing code, your safety net shifts: CI/CD pipelines
(automated quality gate), Sharding (limit the blast radius of
any single change), RBAC / JIT access (control who — and what —
can write), Progressive rollouts & canary deploys across
multiple clusters, AI-written e2e tests (primary validation
harness)." The five-lever safety net, canonicalised as
patterns/rbac-jit-as-agent-safety-net (access control lever)
and reinforcing concepts/blast-radius framing. "Main
deploys to dev without PRGB, so we can validate internally fast.
Production gets canary deploys across multiple clusters."
- Agentic teams, not just agentic individuals. "If you're
blocked on human review, your throughput is gated by the
slowest reviewer. Teams need to embrace AI-assisted reviews and
shift their attention to the high level [...] rather than
nitpicking details. The agents can handle the details." The
author explicitly frames this as a team-level constraint:
the individual's agent throughput is capped by the team's human
PR-review latency, so the team has to move to "a prod branching
model" where "main deploy[s] to dev without PRGB (Peer Review /
Green Build)", with Production behind canary guards.
Systems named
| System |
Role |
| systems/atlassian-fireworks |
The Firecracker-microVM orchestrator on Kubernetes; Atlassian's secure AI-agent execution engine. |
| systems/firecracker |
The micro-VM monitor Fireworks uses; provides the hardware-isolation boundary per workload. |
| systems/kubernetes |
The substrate Fireworks runs on; the "shared AWS scms Kubernetes cluster" is where dev shards live. |
| systems/envoy |
Fireworks' ingress-layer proxy — named verbatim as "envoy ingress layers." |
| systems/bitbucket-pipelines |
The CI pipeline Rovo Dev reads output from when reviewing PRs. |
| systems/rovo-dev |
Atlassian's AI development agent surface — runs the agentic loop; built-in Bitbucket + Pipelines integration is called out as "genuinely great." |
| Pattern |
One-liner |
| patterns/ai-writes-own-e2e-tests |
The agent writes the end-to-end test suite, deploys to a real dev shard, runs tests, and loops on failures until green — the test suite becomes the primary correctness proof. |
| patterns/dev-shard-iteration-loop |
Every feature ships to an isolated, real-cluster dev shard for e2e iteration, catching integration bugs that unit tests miss. |
| patterns/adversarial-review-subagent |
A !review-pr prompt spawns an independent sub-agent with an adversarial prompt that reviews the main agent's PR before a human does. |
| patterns/agent-orchestration-meta-skill |
A skill that encodes the golden-path loops for working on a specific codebase end-to-end, not a single narrow tool binding. |
| patterns/pre-human-agent-review |
For bigger / scarier PRs, spin up an independent reviewer agent before a human even looks at it, so human review time is spent on architecture, not nitpicks. |
| patterns/three-workspace-parallel-agent-workflow |
Three parallel checkouts, three branches, three agents; the human is a dispatcher reading thinking and queueing prompts. |
| patterns/ci-as-agent-quality-gate |
The agent is inside the CI loop: reads lint / vet / test / Helm-validation output and addresses failures autonomously before requesting human review. |
| patterns/rbac-jit-as-agent-safety-net |
Shift the safety net from manual review to RBAC + JIT access control over who and what can write to production. |
Architectural sketch — Fireworks (as described)
Developer / LLM agent
│
│ submit(OCI image, command)
▼
┌───────────────────────┐
│ Fireworks API │ (Envoy ingress layers)
└───────────┬───────────┘
│
┌──────┴──────┐
│ Scheduler │◄────── Autoscaler
│ (Raft-backed│
│ persistence)│
└──────┬──────┘
│ place(VM)
▼
Kubernetes cluster (AWS scms shared)
┌─────┴─────┐
│ Node agent│ ─── runs per node
└─────┬─────┘
│ boot / snapshot-restore / migrate
▼
┌───────────────────┐
│ Firecracker µVM │ hardware isolation
│ (guest workload) │ 100ms warm start
└───────────────────┘
│
▼
eBPF network policy
(enforcement in-kernel)
Features surfaced at this layer:
• 100ms warm starts (snapshot-based fast restore)
• live migration between hosts (VM-level state transfer)
• eBPF network policy enforcement (in-kernel ingress/egress)
• shared volumes (cross-VM data plane)
• snapshot filesystem restore (fast clone)
• sidecar sandboxes (multi-VM co-location per workload)
Operational numbers
| Metric |
Value |
Source phrasing |
| Warm start latency |
100 ms |
"100ms warm starts" |
| Time to build the platform |
4 weeks |
"built in four weeks, entirely by LLMs" |
| Parallel agents per developer |
3 |
"Three workspaces, each checked out on a different branch, each with an agent working" |
| Deploy target for main (pre-prod) |
dev |
"main deploy[s] to dev without PRGB" |
| Production rollout shape |
canary across multiple clusters |
"Production gets canary deploys across multiple clusters" |
| CI validation scope |
lint, vet, tests, Helm validation |
"Every PR runs lint, vet, tests, and Helm validation" |
Caveats
- The claim "entirely by LLMs" is builder-level, not audit-level.
The post doesn't disclose lines-of-code metrics, what fraction of
final code was human-edited, commit-author proportions, or
third-party / open-source code composition. A Firecracker
orchestrator inherits substantial upstream code
(systems/firecracker itself, k8s
client libraries, systems/envoy config, Raft libraries) that
nobody hand-writes. The headline is best read as "the first-party
integration & orchestration code in Fireworks was generated by
LLMs under the builder's supervision, in four weeks."
- Post is product PR and architecture. This is an Atlassian
Rovo Dev marketing post. It passes the
Tier-3 scope filter because it names a real production system
(Fireworks) with concrete features (100ms warm starts, live
migration, eBPF, Raft) and a real workflow (dev shards on the
shared AWS scms K8s cluster,
!review-pr sub-agents, meta-
workflow skills) — the architectural content crosses the 20%
threshold AGENTS.md uses for product launches.
- No numbers on agent success-rate, loop-iterations-per-PR,
false-positive rate on adversarial review. Throughput metrics
("agents busy") and platform-level metrics (100ms warm start)
are disclosed; the agent-quality / loop-convergence metrics that
would let an outside reader reproduce the "4 weeks" headline
are not.
- Team shape is implicit, not disclosed. "Your team needs to
be agentic too" names a constraint; it doesn't say how large
the team is, or what fraction of the Fireworks codebase came
from the single author versus teammates working the same
workflow. The three-workspaces pattern describes one
developer's setup, not a fleet-level organisational norm.
- Tier-3 reliability note. The
Atlassian Engineering blog is Tier-3 on
the sysdesign-wiki; treat architectural claims as what the
builder says they shipped rather than independently verified
production behavior. The five-year-longitudinal track record of
Tier-1 sources (Netflix, AWS, Cloudflare) is not yet established
here.
Source