CONCEPT Cited by 2 sources
Conformance checking¶
Definition¶
Conformance checking is the problem of mechanically verifying that a production implementation behaves as a formal specification (TLA+, Alloy, Event-B, or other) says it should — and continuing to verify this as both evolve. It is the keep-them-in-sync half of specification-driven development; the other half is writing the spec.
Formally, if Bspec is the set of behaviours the spec permits and Bimpl
is the set the implementation actually exhibits, then:
- Refinement:
Bimpl ⊆ Bspec— the implementation never does anything the spec forbids. - Bisimulation:
Bimpl = Bspec— the implementation realises every behaviour the spec permits, and vice versa.
Refinement is the minimum for a correct implementation; bisimulation is stronger and sometimes desirable. Conformance checking mechanises either direction (or both).
The core problem¶
A TLA+ spec that the model-checker accepts is not automatically a spec the implementation obeys. Three failure modes dominate (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):
- The spec was written after the implementation. Abstractions chosen at spec-time may be structurally incompatible with what the code already does. MongoDB's RaftMongo.tla modeled leader handoff as one atomic state transition; the real server steps the old leader down before stepping the new leader up. The gap was known; trying to paper over it in the trace-translator "never worked".
- The implementation drifts. Even if spec + impl agreed at day 1, normal development pushes them apart unless conformance is checked continuously (every commit, not once a year).
- The mechanical checking is expensive. Each direction (trace-check impl-into-spec, generate-tests-from-spec-and-run-against-impl) requires instrumentation, translation glue, and CI integration that does not generalise easily across different specs.
Two complementary techniques¶
Given a TLA+ spec (a state machine), conformance can be checked in either direction:
Trace-checking¶
Starting from the implementation:
- Instrument the implementation to log every state transition.
- Collect execution traces from real / randomized / stress tests.
- Translate each trace into TLA+ (typically: a specialised
Trace.tlawith a sequence of states + an invariant "this behaviour conforms to Spec.tla"). - Run TLC to verify the trace is a valid behaviour of the spec.
If TLC rejects a trace, Bimpl ⊄ Bspec → either the impl is buggy or
the spec is wrong. Following Ron Pressler's technique
is the canonical way to encode the trace-as-TLA+-spec.
Strengths: uses real (or fuzz-driven) executions; no need to enumerate the spec's full state space.
Weaknesses: snapshotting a multithreaded process's state atomically is hard; the mapping from impl state to spec state must be maintained as both evolve; setup is highly spec-specific.
Test-case generation¶
Starting from the spec:
- Constrain the spec to a finite state space.
- Enumerate every behaviour (path) through the state graph using TLC.
- Emit one test per behaviour that forces the implementation to follow that exact transition sequence + asserts impl state matches spec state at each step.
If the impl can't follow some behaviour the spec permits,
Bspec ⊄ Bimpl → the impl is incomplete or buggy.
Strengths: exhaustive within the chosen constraint; one TLA+ change regenerates all tests; emitted tests are just unit tests in the production test suite.
Weaknesses: state-space explosion caps the constraint shape; requires squashing all nondeterminism in the impl to force a specific path (often requires significant test harness investment).
When each technique fits¶
| Technique | Fits | Doesn't fit |
|---|---|---|
| Trace-checking | Highly concurrent systems where enumerating all possible behaviours is infeasible; network-level events you can log atomically at the wire boundary | Internal multithreaded state where atomic snapshots are expensive; specs at a much higher abstraction than impl |
| Test-case generation | Algorithms with bounded state space where the impl can be driven deterministically (conflict-resolution, single-node consensus steps); finding bugs in the spec itself (model-checker runs on the spec before the tests are generated) | Whole-system distributed scenarios with open-ended state; impls where forcing a specific behaviour requires invasive instrumentation |
The MongoDB 2020 experience (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs) gave one data point for each: trace-checking the server's commit-point protocol failed in 10 weeks; test-case generation for the Mobile SDK's OT merge rules succeeded and found a real bug (infinite recursion on swap+move conflict).
The eXtreme Modelling recipe¶
The 2011 paper Concurrent Development of Model and Implementation (Methni et al.) — which inspired MongoDB's work — prescribes:
- Multiple specifications model aspects of the system (not one monolithic spec).
- Specifications are written just prior to the implementation (so abstractions align).
- Specifications evolve with the implementation (not frozen at day 1).
- Tests are generated from the model, and/or trace-checking verifies that test traces are legal in the specification.
Davis's 2025 reading (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs): the recipe is sound; MongoDB violated (2) when it tried to retrofit RaftMongo.tla onto a production server that had been written years earlier, which is why trace-checking failed. Every post-2020 academic win Davis surveys (Mocket, SandTable, Multi-Grained Zookeeper, Confidential Consortium Framework) wrote new specs designed to match the implementation rather than reusing old abstract specs. The lesson generalises: if you're doing conformance checking on an old codebase, be prepared to write the spec that matches your code, not the spec you wish your code matched.
Relation to adjacent techniques¶
- concepts/lightweight-formal-verification — writes an executable spec in the same language as production, checked into the same repo, validated by property tests. Skips the TLA+ round-trip entirely. ShardStore is the canonical realization. Strength: no mismatched-abstraction risk. Weakness: you give up TLA+'s richer spec language for temporal properties.
- concepts/deterministic-simulation — makes the implementation itself the oracle under a seeded PRNG. No adjacent spec; invariants are asserted directly on the impl's observed state. Dropbox Nucleus (CanopyCheck + Trinity) is canonical. Strength: no spec-drift problem. Weakness: no external check that what you assert is what the system should do.
- patterns/property-based-testing — the validator mechanism of both lightweight formal verification (ShardStore) and narrow-scope conformance checking (CanopyCheck). Property tests + an executable spec are one flavor of conformance checking.
The spectrum:
| Technique | Spec artifact | Spec language | Validator |
|---|---|---|---|
| Heavyweight formal methods (Coq, Isabelle) | Proof | Theorem prover | Mechanical proof |
| Conformance checking | TLA+ / Alloy spec | Spec language | TLC + trace-check or test-gen |
| Lightweight formal (ShardStore) | Executable spec | Same as production | Property tests |
| Deterministic simulation (Nucleus) | Implementation itself | Same as production | Invariants + PRNG-driven runs |
| Property-based testing | Invariants only | Test library | Generated inputs |
| Example-based tests | None | Test library | Hand-written cases |
The gradient trades spec-author ramp + reach-of-spec for coverage completeness.
Industrial progress 2020 → 2025¶
Davis's survey (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):
- Test-case generation:
- Mocket — Java-specific; generates tests from TLA+ and instruments Java to force deterministic execution.
- Multi-Grained Specifications for Zookeeper (Schultz + Demirbas) — new TLA+ specs of Zookeeper at multiple abstraction levels; Java instrumentation for variable mapping.
- SandTable — language-agnostic; overrides syscalls to control nondeterminism; samples spec state space for coverage + diversity.
- Design and Modular Verification of Distributed Transactions in MongoDB (VLDB 2025, Schultz + Demirbas) — test-case generation against a new TLA+ spec of MongoDB's WiredTiger storage layer.
- Trace-checking:
- Protocol Conformance with Choreographic PlusCal — new high-level PlusCal-like language compiling to TLA+; generates Go trace-logging functions.
- Validating Traces of Distributed Programs Against TLA+ Specifications — Java trace-logging + TLC validation of incomplete traces (TLC infers missing variable values). New TLC features announced.
- Smart Casual Verification of the Confidential Consortium Framework — identical setup to MongoDB's 2020 failure (new specs of a big old C++ program); using the 2024 TLC features, succeeded after months of effort.
- Finn Hackett (UBC, MongoDB-sponsored) — TLA+-to-Go compiler
- trace-checker prototype; summer 2025 collaboration with Antithesis.
Seen in¶
- sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — canonical industry articulation; named in the title; both techniques attempted at MongoDB; failure + success + five-year academic survey; post-mortem on where trace-checking breaks (multithreaded state snapshot, spec-impl abstraction mismatch, spec-specific setup doesn't generalise).
- sources/2026-02-27-mongodb-towards-model-based-verification-of-a-key-value-storage-engine —
follow-up from the same MongoDB team showing how
compositional spec
structure makes conformance checking tractable at the
storage-engine layer of the production server. The cross-shard
transactions TLA+ spec was written with a clean interface boundary
to WiredTiger; lifting that boundary out as
a standalone
Storage.tlamodule enabled patterns/test-case-generation-from-spec against the real WiredTiger implementation — 87,143 tests for a 2-key × 2-transaction finite model in ~40 minutes, an order of magnitude larger than the 2020 OT case. Concretely supports the concept-page claim that compositional structure + co-designed interface boundaries make conformance work where 2020's retrofit-on-old-spec failed; interface co-design is the load-bearing precondition, not the spec language.