PATTERN Cited by 2 sources
Test-case generation from spec¶
Pattern¶
Given a formal specification (typically in TLA+ or a model-checkable
spec language), enumerate its behaviours using the model-checker
and emit one unit test per behaviour that drives the production
implementation through that exact transition sequence while asserting
impl state matches spec state at each step. A test that the impl fails
to follow means Bspec ⊄ Bimpl — the impl is incomplete or buggy —
and pair-wise with trace-checking this
gives bisimulation-level conformance.
This is the direction of conformance checking that starts with the spec.
The mechanics¶
From the MongoDB 2020 Mobile SDK experiment (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):
- Translate the algorithm to TLA+. Max Hirschhorn manually translated the ~1000-line C++ array-merge portion of the Mobile SDK's OT (19 operation types; 6 array operations; 21 merge rules) to TLA+ in 2 weeks. The model-checker caught translation mistakes quickly.
- Verify the spec model-checks. Use TLC
to check that the spec itself is consistent (all invariants hold
under all behaviours). This is the step that caught the real bug
in MongoDB's case — element-swap + element-move conflict triggered
infinite recursion, crashing TLC with a Java
StackOverflowError. Fix: deprecate the operation. - Constrain the state space so it is finite and enumerable. The Mobile SDK experiment: 3 participants × 3-element array × 1 write op each → 30,184 states / 4,913 behaviours.
- Export the state graph. Ask TLC to emit every behaviour (path through the state graph) in a machine-parseable format.
- Emit one test per behaviour. A generator program parses the state graph and emits a unit test in the production language for each behaviour. The Mobile SDK experiment used a Go program to produce C++ tests.
- Run all tests as part of the production test suite. Each test forces the impl through a specific sequence of transitions + asserts state matches spec state. A failing test means the impl doesn't realise that spec behaviour.
Coverage result¶
From the MongoDB 2020 case:
| Test source | Branch coverage of the OT impl |
|---|---|
| Handwritten unit tests | 21% |
| Millions of AFL-fuzzer runs | 92% |
| 4,913 generated tests from TLA+ spec | 100% |
The generated tests completely covered the implementation's branch space at a tiny fraction of the fuzzer cost (no random-driven exploration, every test is precisely targeted). This is the conformance-checking payoff: exhaustive algorithmic coverage derived from the spec.
Preconditions¶
1. The spec's state space must be constrainable to a finite DAG¶
The MongoDB case constrained the algorithm to 3 participants × 3 elements × 1 write each — small enough for TLC to enumerate in reasonable time. Less-constrained configurations explode combinatorially.
Implication: this pattern fits narrow algorithmic components (conflict resolution, single-node consensus transitions, protocol state machines) where a small constraint reveals the full logical structure. It doesn't fit whole-system distributed runtimes where the state space is open-ended.
2. The impl must be drivable to follow a specific behaviour¶
Generating the test isn't enough — you have to make the impl actually follow the transition sequence. Sources of nondeterminism the test has to squash:
- Scheduling / threading — if the impl runs multiple threads, some deterministic-execution mechanism is required. Mocket instruments Java for this; SandTable overrides syscalls; the Mobile SDK's OT was a sequential algorithm so no squash needed.
- Network / I/O ordering — mock out at the test harness.
- Random choices — seed explicitly.
- Time — mock the clock.
When the algorithm is sequential + single-threaded (OT merge rules), driving it is cheap. When the algorithm spans a distributed runtime (MongoDB server consensus), driving it becomes a project of its own — which is one reason MongoDB's server-side attempt pivoted to trace-checking instead.
3. Spec variables map cleanly to impl state¶
The generated test asserts impl state matches spec state at each step. Writing the assertion requires a mapping from TLA+ state variables to impl variables. When the impl has a direct correspondence (OT merge-rule state = C++ struct fields), this is trivial. When the impl uses multi-level caching / indirections / derived state, writing the mapping is a non-trivial exercise.
Real bugs this finds¶
MongoDB's Mobile SDK case found two classes of bug:
- Spec-side bug — TLC model-checker crashed with infinite recursion on one conflict case. Verified the same bug existed in the C++. Neither unit tests nor the AFL fuzzer had found it in years of production. Fixed by deprecating the operation in both spec and impl.
- Impl-side bugs — any test case where the generated input drives the impl to a state differing from the spec's state. None are named specifically in the 2025 retrospective — implying the Mobile SDK's OT was already correct at the ~1000-line level; the value of the test suite is continuous conformance proof as the impl evolves, not a one-time bug hunt.
When test-case generation wins¶
- Narrow, algorithmic component — conflict resolution, rate limiter, deduplicator, protocol state machine. 100-few thousand lines of impl code.
- Multiple implementations of the same algorithm — the MongoDB case had C++ client + Go server; test-case generation verifies both against the single TLA+ source of truth. Catches divergence.
- Finite-state-space algorithm. Short loops, bounded input shapes, small number of operation types.
- Deterministic execution possible — or the impl is sequential.
When it doesn't fit¶
- Whole-system scenarios — distributed runtimes with unbounded state. patterns/trace-checking or concepts/deterministic-simulation are better fits.
- Impl uses production-only side-effects — file I/O, network, hardware timing — that can't cheaply be driven from a generated test.
- Spec is highly abstract — the mapping to impl state is more work than writing the tests by hand.
Relation to neighbours¶
- patterns/trace-checking — the dual-direction technique; runs the impl against the spec instead of vice versa. Strengths and weaknesses are complementary. Use both for bisimulation.
- patterns/property-based-testing — generates inputs from typed generators, not from an external spec. When the property is the spec, property tests and test-case generation converge; when the spec is more structured than a Boolean predicate, spec generation wins coverage.
- concepts/lightweight-formal-verification — the executable spec + property test style (AWS ShardStore). Skips the TLA+ enumeration step entirely; the executable spec is the oracle.
- systems/tlc-model-checker — the enumerator. The GitHub issue #1073 tracks TLC improvements for exporting state graphs specifically for test-case generation.
2025 state of the art¶
Davis's five-year survey (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs) names four production-scale test-case-generation systems since 2020:
- Mocket — Java-specific; generates tests from TLA+ + instruments Java for deterministic execution + checks variable equality at each action.
- Multi-Grained Specifications for Zookeeper — Schultz + Demirbas; new multi-level TLA+ specs of Zookeeper; Java instrumentation for variable mapping; test framework randomly explores spec behaviours while impl follows.
- SandTable — language-agnostic (overrides syscalls); samples spec state space for coverage + diversity + behaviour-length; emphasizes writing new specs matching the implementation.
- VLDB 2025 MongoDB transactions paper — Schultz + Demirbas again; test-case generation against a new TLA+ spec of MongoDB's WiredTiger storage layer.
Common thread: write new specs designed to match the implementation. Old abstract specs don't align well enough for test-case generation to work — the same lesson patterns/trace-checking taught.
Seen in¶
- sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — canonical industry case: MongoDB Mobile SDK's OT merge rules; 2-week TLA+ translation; 3×3×1 state-space constraint → 4,913 behaviours → 4,913 generated C++ tests → 100% branch coverage; found one real infinite-recursion bug via TLC crash; five-year academic survey covering Mocket, SandTable, Multi-Grained Zookeeper, VLDB 2025 MongoDB transactions.
- sources/2026-02-27-mongodb-towards-model-based-verification-of-a-key-value-storage-engine —
second production MongoDB instance of this pattern, now at the
storage-engine layer of the core server rather than a narrow
algorithmic component. The team's
compositional cross-shard
transactions TLA+ spec extracts its interface to the
WiredTiger storage engine as a standalone
Storage.tlamodule; a modified TLC enumeratesStorage.tla's complete reachable state graph for a finite configuration; the team computes path coverings over that graph and emits one test case per path as a sequence of WiredTiger API calls. Scale: 2 keys × 2 transactions → 87,143 tests generated + executed against WiredTiger in ~40 minutes — an order of magnitude larger generated-suite than the 2020 OT case. Roadmap calls out approximate-coverage strategies (randomized path sampling) as the next scaling lever. Specs + generator at mongodb-labs/vldb25-dist-txns; VLDB paper p5045-schultz.pdf. Second contribution: demonstrates the pattern works on a storage-engine contract, not just a sequential single-threaded algorithm — the key was extracting a narrow, state-driven, directly API-implementable interface boundary as its own spec module.