Skip to content

CONCEPT Cited by 1 source

Black-box validation

Definition

Validate a system by its observable inputs and outputs, not by reading its implementation. Submit a known input, check the output against expected behaviour, and treat agreement / disagreement as your sole correctness signal. Internals are deliberately opaque.

Atlassian's Rovo Dev / Fireworks post is the canonical wiki articulation:

"If I need to verify, I test outputs, not read code. Submit a job, check it boots in 100ms, verify migration preserves state, confirm network policy blocks what it should. Black box validation." (Source: sources/2026-04-24-atlassian-rovo-dev-driven-development)

And, as a design principle:

"Treat code as a black box. If you can comprehensively validate via inputs and outputs, you often don't need to read the code and what it's doing."

Why it matters now

Black-box validation is not new — it is the foundation of integration testing and e2e testing as practiced for decades. What is new is the reframing as the primary validation path when LLMs write the code. If no human is reading the generated code line-by-line, then code review is no longer the cheap correctness proof it used to be. The test suite has to carry that load alone.

The architectural consequence: the work that used to go into reading and reviewing code has to move to specifying observable invariants the system must satisfy, and writing tests that assert them.

The specification shifts up

The human's responsibility moves from "does this code look right" to "what are the invariants this system must hold":

Invariant category Fireworks example
Latency / performance "boots in 100 ms"
State correctness under disruption "migration preserves state"
Security boundary "network policy blocks what it should"

Each invariant is expressed as a testable proposition. The agent writes the test; the test asserts the invariant; the test passing is the correctness proof.

Three properties of a well-specified black-box invariant

  1. Observable externally. If asserting the invariant requires inspecting internal state, it fails the black-box criterion. Invariants should be checkable by a test harness that sees only the same surface a client of the system would see.
  2. Binary signal. The invariant either holds or does not. "Migration preserves state" passes or fails — no interpretation step required of the human reviewer.
  3. Named in specification, not implementation. "100 ms warm start" is a spec-level commitment. "Calls getBootTime() and subtracts" is an implementation detail. The invariant is in the former language.

Relation to the agentic development loop

Black-box validation is the correctness tier of the agentic development loop. The loop needs unambiguous feedback to ground hallucination correction — black-box validation provides it: "I ran the test, the output was X, the expected was Y, therefore the current code is wrong." Reading code for an LLM-written system is both slower (the human has to load it into working memory) and weaker (the LLM may have written code that looks right but does the wrong thing). The observed output cannot lie.

The counter-argument the post doesn't dodge

"If you're reading any code, read the tests."

The post acknowledges the obvious failure mode: if the LLM writes both the code and the tests, and the tests are wrong, the loop converges on broken code that passes broken tests. The proposed discipline: the tests are the one thing the human should be paying attention to — precisely because they are the embodied specification. Reading the production code directly is lower priority than reading the test that asserts behaviour.

This is a non-trivial process commitment. It is not a property of LLMs; it is a property of how the team uses the LLMs. See patterns/ai-writes-own-e2e-tests for the full test-as-primary-harness pattern.

Not the same as

  • Unit-test-only verification. Unit tests are often not black-box; they frequently inspect internal state, mock collaborators, or assert implementation details. Black-box validation typically lives at the integration / e2e tier. See concepts/test-pyramid.
  • Synthetic monitoring. A similar-feeling practice, but continuous-in-production rather than pre-merge. See patterns/e2e-test-as-synthetic-probe for the synthetic-probe extension of the same idea.
  • Code review being replaced entirely. The post still has human code review — just shifted "to the high level: architecture, design intent, risk", not line-by-line reading.

Seen in

  • sources/2026-04-24-atlassian-rovo-dev-driven-development — canonical articulation. "I test outputs, not read code." The shift from manual review to black-box validation is the load-bearing process claim that makes "entirely by LLMs in four weeks" plausible at the builder's level.
Last updated · 510 distilled / 1,221 read