Skip to content

CONCEPT Cited by 1 source

AI writes own tests

Definition

The AI agent writes both the production code and the test suite that validates it — the agent produces the e2e tests, deploys to a real environment, runs the tests, and loops on failures until green. The test suite is the primary correctness proof, not a separate human-authored oracle.

Atlassian's Rovo Dev / Fireworks post is the canonical wiki articulation:

"AI writes the e2e tests too. The agent writes tests, deploys to a dev shard, runs them, and loops on failures until they pass. The test suite is the primary proof that things work." (Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)

Why this is counterintuitive but works

The intuitive objection: if the LLM writes the tests, it will write tests that pass for the wrong reason — it will hallucinate both the code and the tests in a self-consistent but wrong way. This is a real risk. The post's stance on it:

"If you're reading any code, read the tests."

Three properties make it work in practice:

  1. The tests are the specification that gets human-reviewed. The human is not reviewing the production code line-by-line; they are reviewing the tests. The test suite is the specification the human is holding the agent accountable to.
  2. The tests run in a real environment. e2e tests that deploy to a real dev shard and exercise the real integration surface can't pass on a fabricated integration (the integration doesn't exist to mock). See patterns/dev-shard-iteration-loop.
  3. The loop converges on passing tests against observable invariants. See concepts/black-box-validation — if the invariant is "boots in 100 ms" or "network policy blocks X", the test can't pass without actually observing that outcome.

The division of labour

Artifact Who writes it Who reviews it
Production code AI agent AI adversarial sub-agent, CI quality gate, architecture review
e2e test suite AI agent Human (primary), AI adversarial sub-agent
Observable invariants ("spec") Human

The human writes nothing the agent could write; the agent writes everything the human specified. The handoff is at the invariant level, not the code level.

Relation to automated test generation from production traffic

A Zalando instance ([[concepts/automated-test-generation-from- production-traffic]]) generates tests by capturing real production inputs and replaying them. The Rovo-Dev instance generates tests by prompting an LLM with the spec and asking it to propose tests. Both are instances of "don't hand-write tests," but the input substrate differs:

Source Input substrate Tradeoff
Production-traffic replay Real observed inputs High realism, but only for code paths that exist in prod
AI-written e2e tests LLM proposals from the spec Covers new features before any prod traffic, but inputs must be validated against the spec

The two are complementary, not competing — a mature agentic pipeline can use both, with replay covering regression and AI-written tests covering greenfield features.

Failure modes

  • Tautological tests. Agent writes a test that is effectively assert(function() == function()), i.e. asserts the code against itself. Human-review of the tests catches this. Spec-level invariants ("boots in 100 ms") don't admit this failure mode — the test can't be tautological if the invariant is observable- externally.
  • Missing-coverage gaps. Agent writes tests for the happy path but misses error cases. Adversarial review sub-agent (concepts/adversarial-review-persona) is the designed mitigation — "have an adversarial persona subagent that ... reviews what the main agent has written."
  • Integration gaps. Unit-level tests only. Mitigated by the e2e / dev-shard requirement — every feature must have tests that run against a real cluster shard, not just a mocked integration.

Seen in

Last updated · 510 distilled / 1,221 read