Skip to content

PATTERN Cited by 1 source

AI writes own e2e tests

Intent

Let the AI agent write the e2e test suite (not just the production code), deploy to a real test environment, run the tests, and loop on failures until green. Use the resulting test suite as the primary correctness proof; the human does not line-read production code.

Canonical articulation — Atlassian Fireworks, 2026-04-24:

"AI writes the e2e tests too. The agent writes tests, deploys to a dev shard, runs them, and loops on failures until they pass. The test suite is the primary proof that things work." (Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)

Shape

  [specify invariants] ──► [agent writes code + e2e tests]
                          [deploy to dev shard]
                          [run e2e tests]
                             │        │
                          pass       fail
                             │        │
                             │        ▼
                             │   [agent reads failure, patches code + test]
                             │        │
                             │        └── loop ──┐
                             ▼                   │
                   [adversarial review sub-agent]
                   [CI quality gate: lint/vet/Helm]
                   [human: architecture review]

When it fits

  • LLM-generated production code. If the human isn't reading the production code, the tests have to carry the correctness guarantee, and you're going to need more tests than a human would hand-write.
  • Real dev-shard environment is available. The agent needs a real environment to deploy against — unit tests alone won't catch integration bugs. See patterns/dev-shard-iteration-loop.
  • Observable invariants are specifiable. The invariants the tests assert should be externally observable, not internal implementation details — see concepts/black-box-validation.

When it doesn't fit

  • Test infrastructure is expensive / slow. The loop depends on cheap re-runs; if each e2e test costs minutes and dollars, the agent's iterate-on-failures rhythm breaks down.
  • Test oracle is unclear. If the invariant the tests should assert is not specifiable ("does this UI look right", "is this recommendation good"), the agent can't write a passing test without the invariant being articulated by a human first.
  • Safety-critical domains where test-suite-as-sole-correctness- proof is not acceptable. Aviation, medical devices, etc., where regulatory regimes require human audit of production code.

Composition with other patterns

With Effect
patterns/dev-shard-iteration-loop Provides the real-cluster substrate the e2e tests deploy against
patterns/adversarial-review-subagent Catches tautological / missing-coverage tests the main agent would not challenge itself on
patterns/ci-as-agent-quality-gate Automated lint / vet / Helm gate between agent output and human review
patterns/pre-human-agent-review Overarching three-tier review model this pattern is the validation-harness corner of

Guardrails

  1. Human reads the tests, not the code. "If you're reading any code, read the tests." The test suite is the specification artifact the human holds the agent accountable to.
  2. Invariants are spec-level, not implementation-level. Tests assert "boots in 100 ms", "network policy blocks X", "state is preserved across migration" — not "calls function Y with argument Z".
  3. Tests run in real environments, not just mocks. A test that only exercises mocked collaborators can be made to pass by an agent that hallucinates both the code and the mocks consistently. e2e + dev shard + real cluster breaks that failure mode.

Seen in

Last updated · 510 distilled / 1,221 read