Skip to content

CONCEPT Cited by 2 sources

Conformance checking

Definition

Conformance checking is the problem of mechanically verifying that a production implementation behaves as a formal specification (TLA+, Alloy, Event-B, or other) says it should — and continuing to verify this as both evolve. It is the keep-them-in-sync half of specification-driven development; the other half is writing the spec.

Formally, if Bspec is the set of behaviours the spec permits and Bimpl is the set the implementation actually exhibits, then:

  • Refinement: Bimpl ⊆ Bspec — the implementation never does anything the spec forbids.
  • Bisimulation: Bimpl = Bspec — the implementation realises every behaviour the spec permits, and vice versa.

Refinement is the minimum for a correct implementation; bisimulation is stronger and sometimes desirable. Conformance checking mechanises either direction (or both).

The core problem

A TLA+ spec that the model-checker accepts is not automatically a spec the implementation obeys. Three failure modes dominate (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

  1. The spec was written after the implementation. Abstractions chosen at spec-time may be structurally incompatible with what the code already does. MongoDB's RaftMongo.tla modeled leader handoff as one atomic state transition; the real server steps the old leader down before stepping the new leader up. The gap was known; trying to paper over it in the trace-translator "never worked".
  2. The implementation drifts. Even if spec + impl agreed at day 1, normal development pushes them apart unless conformance is checked continuously (every commit, not once a year).
  3. The mechanical checking is expensive. Each direction (trace-check impl-into-spec, generate-tests-from-spec-and-run-against-impl) requires instrumentation, translation glue, and CI integration that does not generalise easily across different specs.

Two complementary techniques

Given a TLA+ spec (a state machine), conformance can be checked in either direction:

Trace-checking

Starting from the implementation:

  1. Instrument the implementation to log every state transition.
  2. Collect execution traces from real / randomized / stress tests.
  3. Translate each trace into TLA+ (typically: a specialised Trace.tla with a sequence of states + an invariant "this behaviour conforms to Spec.tla").
  4. Run TLC to verify the trace is a valid behaviour of the spec.

If TLC rejects a trace, Bimpl ⊄ Bspec → either the impl is buggy or the spec is wrong. Following Ron Pressler's technique is the canonical way to encode the trace-as-TLA+-spec.

Strengths: uses real (or fuzz-driven) executions; no need to enumerate the spec's full state space.

Weaknesses: snapshotting a multithreaded process's state atomically is hard; the mapping from impl state to spec state must be maintained as both evolve; setup is highly spec-specific.

Test-case generation

Starting from the spec:

  1. Constrain the spec to a finite state space.
  2. Enumerate every behaviour (path) through the state graph using TLC.
  3. Emit one test per behaviour that forces the implementation to follow that exact transition sequence + asserts impl state matches spec state at each step.

If the impl can't follow some behaviour the spec permits, Bspec ⊄ Bimpl → the impl is incomplete or buggy.

Strengths: exhaustive within the chosen constraint; one TLA+ change regenerates all tests; emitted tests are just unit tests in the production test suite.

Weaknesses: state-space explosion caps the constraint shape; requires squashing all nondeterminism in the impl to force a specific path (often requires significant test harness investment).

When each technique fits

Technique Fits Doesn't fit
Trace-checking Highly concurrent systems where enumerating all possible behaviours is infeasible; network-level events you can log atomically at the wire boundary Internal multithreaded state where atomic snapshots are expensive; specs at a much higher abstraction than impl
Test-case generation Algorithms with bounded state space where the impl can be driven deterministically (conflict-resolution, single-node consensus steps); finding bugs in the spec itself (model-checker runs on the spec before the tests are generated) Whole-system distributed scenarios with open-ended state; impls where forcing a specific behaviour requires invasive instrumentation

The MongoDB 2020 experience (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs) gave one data point for each: trace-checking the server's commit-point protocol failed in 10 weeks; test-case generation for the Mobile SDK's OT merge rules succeeded and found a real bug (infinite recursion on swap+move conflict).

The eXtreme Modelling recipe

The 2011 paper Concurrent Development of Model and Implementation (Methni et al.) — which inspired MongoDB's work — prescribes:

  1. Multiple specifications model aspects of the system (not one monolithic spec).
  2. Specifications are written just prior to the implementation (so abstractions align).
  3. Specifications evolve with the implementation (not frozen at day 1).
  4. Tests are generated from the model, and/or trace-checking verifies that test traces are legal in the specification.

Davis's 2025 reading (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs): the recipe is sound; MongoDB violated (2) when it tried to retrofit RaftMongo.tla onto a production server that had been written years earlier, which is why trace-checking failed. Every post-2020 academic win Davis surveys (Mocket, SandTable, Multi-Grained Zookeeper, Confidential Consortium Framework) wrote new specs designed to match the implementation rather than reusing old abstract specs. The lesson generalises: if you're doing conformance checking on an old codebase, be prepared to write the spec that matches your code, not the spec you wish your code matched.

Relation to adjacent techniques

  • concepts/lightweight-formal-verification — writes an executable spec in the same language as production, checked into the same repo, validated by property tests. Skips the TLA+ round-trip entirely. ShardStore is the canonical realization. Strength: no mismatched-abstraction risk. Weakness: you give up TLA+'s richer spec language for temporal properties.
  • concepts/deterministic-simulation — makes the implementation itself the oracle under a seeded PRNG. No adjacent spec; invariants are asserted directly on the impl's observed state. Dropbox Nucleus (CanopyCheck + Trinity) is canonical. Strength: no spec-drift problem. Weakness: no external check that what you assert is what the system should do.
  • patterns/property-based-testing — the validator mechanism of both lightweight formal verification (ShardStore) and narrow-scope conformance checking (CanopyCheck). Property tests + an executable spec are one flavor of conformance checking.

The spectrum:

Technique Spec artifact Spec language Validator
Heavyweight formal methods (Coq, Isabelle) Proof Theorem prover Mechanical proof
Conformance checking TLA+ / Alloy spec Spec language TLC + trace-check or test-gen
Lightweight formal (ShardStore) Executable spec Same as production Property tests
Deterministic simulation (Nucleus) Implementation itself Same as production Invariants + PRNG-driven runs
Property-based testing Invariants only Test library Generated inputs
Example-based tests None Test library Hand-written cases

The gradient trades spec-author ramp + reach-of-spec for coverage completeness.

Industrial progress 2020 → 2025

Davis's survey (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

Seen in

  • sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — canonical industry articulation; named in the title; both techniques attempted at MongoDB; failure + success + five-year academic survey; post-mortem on where trace-checking breaks (multithreaded state snapshot, spec-impl abstraction mismatch, spec-specific setup doesn't generalise).
  • sources/2026-02-27-mongodb-towards-model-based-verification-of-a-key-value-storage-enginefollow-up from the same MongoDB team showing how compositional spec structure makes conformance checking tractable at the storage-engine layer of the production server. The cross-shard transactions TLA+ spec was written with a clean interface boundary to WiredTiger; lifting that boundary out as a standalone Storage.tla module enabled patterns/test-case-generation-from-spec against the real WiredTiger implementation — 87,143 tests for a 2-key × 2-transaction finite model in ~40 minutes, an order of magnitude larger than the 2020 OT case. Concretely supports the concept-page claim that compositional structure + co-designed interface boundaries make conformance work where 2020's retrofit-on-old-spec failed; interface co-design is the load-bearing precondition, not the spec language.
Last updated · 200 distilled / 1,178 read