PATTERN Cited by 1 source

Executable specification¶

Pattern¶

Write a compact model of the system under test, in the same language as the production implementation, commit it to the same repo, and continuously validate that the production code's behavior matches the model's — typically via property-based testing.

The spec: - Is executable (runs, not just a document). - Drops all real-world complexity (hardware, layout, concurrency specifics that don't affect logical correctness). - Is typically ~1% the size of the production system. - Is maintainable by the same engineers who maintain production — no separate formal-methods skill required.

The ShardStore instantiation (Warfield, 2025)¶

We built a simplified model of ShardStore's logic, (also in Rust), and checked into the same repository alongside the real production ShardStore implementation. This model dropped all the complexity of the actual on-disk storage layers and hard drives, and instead acted as a compact but executable specification. It wound up being about 1% of the size of the real system, but allowed us to perform testing at a level that would have been completely impractical to do against a hard drive with 120 available IOPS.

From here, we've been able to build tools and use existing techniques, like property-based testing, to generate test cases that verify that the behaviour of the implementation matches that of the specification.

See systems/shardstore for the system; concepts/lightweight-formal-verification for the conceptual placement.

Why "in the same language"¶

Review overhead collapses. Production engineers can read, critique, and edit the spec without context-switching to TLA+, Coq, etc.
Continuous integration is trivial. The spec is just another Cargo / cargo test target; it runs on every commit.
Drift is visible. If implementation and spec disagree about a behavior, the property test fails and the diff is in one language.
The spec is a contract. A non-specialist can look at the spec and say "is this what ShardStore is supposed to do?" and answer the question.

Why ~1% of the size¶

The spec strips out everything that is incidental to logical correctness: - On-disk layouts and serialization. - Hardware-specific timings and failure modes. - Concurrency primitives (beyond what logical semantics demand). - Optimizations (caching, batching).

What remains is the pure logical behavior of the component — that's what property-based testing exercises against.

Property-based testing as the validator¶

Rather than hand-write test cases, the team writes invariants ("this operation followed by that one always leaves the store in a consistent state") and lets a generator explore state space automatically. The spec serves as the oracle: for any generated scenario, production and spec should agree.

This is where lightweight formal verification differs from classical formal methods: the coverage is empirical, up to the generator's budget, rather than proven. In return, the approach scales to production team size and pace.

When to reach for this¶

The component's correctness is durability-critical (cost of bugs is very high).
The component is narrow enough that ~1% of the code can capture its logic (e.g., a storage node, a consensus engine, a transaction log).
Real-hardware testing is throughput-bounded (a 120-IOPS drive; a disk with a rare-corruption failure mode).
The team is ready to own the spec as first-class code, not just a paper document.

When not to reach for this¶

The component is mostly glue or integration — there's no compact logical model to write.
The correctness property is already trivially enforced by the type system or a single constraint.
The team cannot commit to maintaining the spec alongside code (spec drift makes this pattern actively harmful).

Seen in¶

sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — ShardStore as the canonical example; SOSP paper as the academic companion.