PATTERN Cited by 1 source
AI writes own e2e tests¶
Intent¶
Let the AI agent write the e2e test suite (not just the production code), deploy to a real test environment, run the tests, and loop on failures until green. Use the resulting test suite as the primary correctness proof; the human does not line-read production code.
Canonical articulation — Atlassian Fireworks, 2026-04-24:
"AI writes the e2e tests too. The agent writes tests, deploys to a dev shard, runs them, and loops on failures until they pass. The test suite is the primary proof that things work." (Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)
Shape¶
[specify invariants] ──► [agent writes code + e2e tests]
│
▼
[deploy to dev shard]
│
▼
[run e2e tests]
│ │
pass fail
│ │
│ ▼
│ [agent reads failure, patches code + test]
│ │
│ └── loop ──┐
▼ │
[adversarial review sub-agent]
│
▼
[CI quality gate: lint/vet/Helm]
│
▼
[human: architecture review]
When it fits¶
- LLM-generated production code. If the human isn't reading the production code, the tests have to carry the correctness guarantee, and you're going to need more tests than a human would hand-write.
- Real dev-shard environment is available. The agent needs a real environment to deploy against — unit tests alone won't catch integration bugs. See patterns/dev-shard-iteration-loop.
- Observable invariants are specifiable. The invariants the tests assert should be externally observable, not internal implementation details — see concepts/black-box-validation.
When it doesn't fit¶
- Test infrastructure is expensive / slow. The loop depends on cheap re-runs; if each e2e test costs minutes and dollars, the agent's iterate-on-failures rhythm breaks down.
- Test oracle is unclear. If the invariant the tests should assert is not specifiable ("does this UI look right", "is this recommendation good"), the agent can't write a passing test without the invariant being articulated by a human first.
- Safety-critical domains where test-suite-as-sole-correctness- proof is not acceptable. Aviation, medical devices, etc., where regulatory regimes require human audit of production code.
Composition with other patterns¶
| With | Effect |
|---|---|
| patterns/dev-shard-iteration-loop | Provides the real-cluster substrate the e2e tests deploy against |
| patterns/adversarial-review-subagent | Catches tautological / missing-coverage tests the main agent would not challenge itself on |
| patterns/ci-as-agent-quality-gate | Automated lint / vet / Helm gate between agent output and human review |
| patterns/pre-human-agent-review | Overarching three-tier review model this pattern is the validation-harness corner of |
Guardrails¶
- Human reads the tests, not the code. "If you're reading any code, read the tests." The test suite is the specification artifact the human holds the agent accountable to.
- Invariants are spec-level, not implementation-level. Tests assert "boots in 100 ms", "network policy blocks X", "state is preserved across migration" — not "calls function Y with argument Z".
- Tests run in real environments, not just mocks. A test that only exercises mocked collaborators can be made to pass by an agent that hallucinates both the code and the mocks consistently. e2e + dev shard + real cluster breaks that failure mode.
Seen in¶
- sources/2026-04-24-atlassian-rovo-dev-driven-development — canonical instance. The Fireworks team shipped a Firecracker- on-K8s platform in four weeks with AI-written e2e tests as the primary correctness proof.