Skip to content

PATTERN Cited by 2 sources

Test-case generation from spec

Pattern

Given a formal specification (typically in TLA+ or a model-checkable spec language), enumerate its behaviours using the model-checker and emit one unit test per behaviour that drives the production implementation through that exact transition sequence while asserting impl state matches spec state at each step. A test that the impl fails to follow means Bspec ⊄ Bimpl — the impl is incomplete or buggy — and pair-wise with trace-checking this gives bisimulation-level conformance.

This is the direction of conformance checking that starts with the spec.

The mechanics

From the MongoDB 2020 Mobile SDK experiment (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

  1. Translate the algorithm to TLA+. Max Hirschhorn manually translated the ~1000-line C++ array-merge portion of the Mobile SDK's OT (19 operation types; 6 array operations; 21 merge rules) to TLA+ in 2 weeks. The model-checker caught translation mistakes quickly.
  2. Verify the spec model-checks. Use TLC to check that the spec itself is consistent (all invariants hold under all behaviours). This is the step that caught the real bug in MongoDB's case — element-swap + element-move conflict triggered infinite recursion, crashing TLC with a Java StackOverflowError. Fix: deprecate the operation.
  3. Constrain the state space so it is finite and enumerable. The Mobile SDK experiment: 3 participants × 3-element array × 1 write op each → 30,184 states / 4,913 behaviours.
  4. Export the state graph. Ask TLC to emit every behaviour (path through the state graph) in a machine-parseable format.
  5. Emit one test per behaviour. A generator program parses the state graph and emits a unit test in the production language for each behaviour. The Mobile SDK experiment used a Go program to produce C++ tests.
  6. Run all tests as part of the production test suite. Each test forces the impl through a specific sequence of transitions + asserts state matches spec state. A failing test means the impl doesn't realise that spec behaviour.

Coverage result

From the MongoDB 2020 case:

Test source Branch coverage of the OT impl
Handwritten unit tests 21%
Millions of AFL-fuzzer runs 92%
4,913 generated tests from TLA+ spec 100%

The generated tests completely covered the implementation's branch space at a tiny fraction of the fuzzer cost (no random-driven exploration, every test is precisely targeted). This is the conformance-checking payoff: exhaustive algorithmic coverage derived from the spec.

Preconditions

1. The spec's state space must be constrainable to a finite DAG

The MongoDB case constrained the algorithm to 3 participants × 3 elements × 1 write each — small enough for TLC to enumerate in reasonable time. Less-constrained configurations explode combinatorially.

Implication: this pattern fits narrow algorithmic components (conflict resolution, single-node consensus transitions, protocol state machines) where a small constraint reveals the full logical structure. It doesn't fit whole-system distributed runtimes where the state space is open-ended.

2. The impl must be drivable to follow a specific behaviour

Generating the test isn't enough — you have to make the impl actually follow the transition sequence. Sources of nondeterminism the test has to squash:

  • Scheduling / threading — if the impl runs multiple threads, some deterministic-execution mechanism is required. Mocket instruments Java for this; SandTable overrides syscalls; the Mobile SDK's OT was a sequential algorithm so no squash needed.
  • Network / I/O ordering — mock out at the test harness.
  • Random choices — seed explicitly.
  • Time — mock the clock.

When the algorithm is sequential + single-threaded (OT merge rules), driving it is cheap. When the algorithm spans a distributed runtime (MongoDB server consensus), driving it becomes a project of its own — which is one reason MongoDB's server-side attempt pivoted to trace-checking instead.

3. Spec variables map cleanly to impl state

The generated test asserts impl state matches spec state at each step. Writing the assertion requires a mapping from TLA+ state variables to impl variables. When the impl has a direct correspondence (OT merge-rule state = C++ struct fields), this is trivial. When the impl uses multi-level caching / indirections / derived state, writing the mapping is a non-trivial exercise.

Real bugs this finds

MongoDB's Mobile SDK case found two classes of bug:

  1. Spec-side bug — TLC model-checker crashed with infinite recursion on one conflict case. Verified the same bug existed in the C++. Neither unit tests nor the AFL fuzzer had found it in years of production. Fixed by deprecating the operation in both spec and impl.
  2. Impl-side bugs — any test case where the generated input drives the impl to a state differing from the spec's state. None are named specifically in the 2025 retrospective — implying the Mobile SDK's OT was already correct at the ~1000-line level; the value of the test suite is continuous conformance proof as the impl evolves, not a one-time bug hunt.

When test-case generation wins

  • Narrow, algorithmic component — conflict resolution, rate limiter, deduplicator, protocol state machine. 100-few thousand lines of impl code.
  • Multiple implementations of the same algorithm — the MongoDB case had C++ client + Go server; test-case generation verifies both against the single TLA+ source of truth. Catches divergence.
  • Finite-state-space algorithm. Short loops, bounded input shapes, small number of operation types.
  • Deterministic execution possible — or the impl is sequential.

When it doesn't fit

  • Whole-system scenarios — distributed runtimes with unbounded state. patterns/trace-checking or concepts/deterministic-simulation are better fits.
  • Impl uses production-only side-effects — file I/O, network, hardware timing — that can't cheaply be driven from a generated test.
  • Spec is highly abstract — the mapping to impl state is more work than writing the tests by hand.

Relation to neighbours

  • patterns/trace-checking — the dual-direction technique; runs the impl against the spec instead of vice versa. Strengths and weaknesses are complementary. Use both for bisimulation.
  • patterns/property-based-testing — generates inputs from typed generators, not from an external spec. When the property is the spec, property tests and test-case generation converge; when the spec is more structured than a Boolean predicate, spec generation wins coverage.
  • concepts/lightweight-formal-verification — the executable spec + property test style (AWS ShardStore). Skips the TLA+ enumeration step entirely; the executable spec is the oracle.
  • systems/tlc-model-checker — the enumerator. The GitHub issue #1073 tracks TLC improvements for exporting state graphs specifically for test-case generation.

2025 state of the art

Davis's five-year survey (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs) names four production-scale test-case-generation systems since 2020:

  • Mocket — Java-specific; generates tests from TLA+ + instruments Java for deterministic execution + checks variable equality at each action.
  • Multi-Grained Specifications for Zookeeper — Schultz + Demirbas; new multi-level TLA+ specs of Zookeeper; Java instrumentation for variable mapping; test framework randomly explores spec behaviours while impl follows.
  • SandTable — language-agnostic (overrides syscalls); samples spec state space for coverage + diversity + behaviour-length; emphasizes writing new specs matching the implementation.
  • VLDB 2025 MongoDB transactions paper — Schultz + Demirbas again; test-case generation against a new TLA+ spec of MongoDB's WiredTiger storage layer.

Common thread: write new specs designed to match the implementation. Old abstract specs don't align well enough for test-case generation to work — the same lesson patterns/trace-checking taught.

Seen in

  • sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — canonical industry case: MongoDB Mobile SDK's OT merge rules; 2-week TLA+ translation; 3×3×1 state-space constraint → 4,913 behaviours → 4,913 generated C++ tests → 100% branch coverage; found one real infinite-recursion bug via TLC crash; five-year academic survey covering Mocket, SandTable, Multi-Grained Zookeeper, VLDB 2025 MongoDB transactions.
  • sources/2026-02-27-mongodb-towards-model-based-verification-of-a-key-value-storage-enginesecond production MongoDB instance of this pattern, now at the storage-engine layer of the core server rather than a narrow algorithmic component. The team's compositional cross-shard transactions TLA+ spec extracts its interface to the WiredTiger storage engine as a standalone Storage.tla module; a modified TLC enumerates Storage.tla's complete reachable state graph for a finite configuration; the team computes path coverings over that graph and emits one test case per path as a sequence of WiredTiger API calls. Scale: 2 keys × 2 transactions → 87,143 tests generated + executed against WiredTiger in ~40 minutes — an order of magnitude larger generated-suite than the 2020 OT case. Roadmap calls out approximate-coverage strategies (randomized path sampling) as the next scaling lever. Specs + generator at mongodb-labs/vldb25-dist-txns; VLDB paper p5045-schultz.pdf. Second contribution: demonstrates the pattern works on a storage-engine contract, not just a sequential single-threaded algorithm — the key was extracting a narrow, state-driven, directly API-implementable interface boundary as its own spec module.
Last updated · 200 distilled / 1,178 read