MONGODB 2025-06-02 Tier 2

MongoDB — Conformance Checking at MongoDB: Testing That Our Code Matches Our TLA+ Specs¶

Summary¶

A. Jesse Jiryu Davis's 2025 retrospective (from 2025's perspective) on the 2020 VLDB paper eXtreme Modelling in Practice — MongoDB's attempt to close the gap between its production C++ / Go implementations and its TLA+ specs. Two parallel experiments: trace-checking of the MongoDB server's Raft-like commit-point protocol (Davis + Judah Schvimer) failed, and test-case generation from a TLA+ spec of the MongoDB Mobile SDK's Operational Transform merge rules (Max Hirschhorn) succeeded. The piece is the canonical industry articulation of why conformance checking is the live gap in spec-driven engineering: a spec that model-checks is not a spec the implementation obeys, and the 2020 mechanics for bridging that gap were prohibitively expensive on a real production system. Adjacent to sources/2025-02-25-allthingsdistributed-building-and-operating-s3 (AWS ShardStore's executable-spec industrialization) and sources/2024-05-31-dropbox-testing-sync-at-dropbox-2020 (Dropbox Nucleus's deterministic simulation testing) — all three target the same problem (prove the real system matches the intended behaviour at production scale) with three different stance choices.

Key takeaways¶

Conformance checking is the live gap. Once you have both a TLA+ spec and a production implementation, keeping them in sync as both evolve is the unsolved part. Davis frames the status quo: "Too often, an engineer tries to write one huge TLA+ spec for the whole system. It's too complex and detailed… the author abandons the spec and concludes that TLA+ is impractical." (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs). The eXtreme Modelling alternative: many small specs, each focused on one aspect, written alongside (not after) the implementation, and continuously conformance-checked.
Two complementary mechanical techniques. A TLA+ spec is a state machine; its behaviours (Bspec) are paths through the state graph. An implementation refines a spec iff Bimpl ⊂ Bspec; bisimulation is Bimpl = Bspec. Two ways to check:
Trace-checking — log the implementation's state transitions to a file; translate the trace into TLA+; ask the TLC model-checker "is this trace allowed by the spec?" If not, Bimpl ⊄ Bspec → bug in impl or spec.
Test-case generation — enumerate behaviours of the spec; emit a unit test per behaviour that forces the impl to follow it; check impl state matches spec state. If the impl can't follow, Bspec ⊄ Bimpl → bug in impl.
Trace-checking the MongoDB server FAILED. Davis + Schvimer picked RaftMongo.tla — the spec of how servers learn the commit point in MongoDB's Raft-like consensus. ~300 hand-written JS integration tests + the "rollback fuzzer" randomized test were instrumented to log state; a Python script merged per-node logs by timestamp and emitted Trace.tla files; TLC was asked to verify each trace was a valid behaviour of RaftMongo.tla. It never worked in 10 weeks. Three named lessons (see takeaways 4–6).
Lesson 1: Snapshotting a multithreaded program's state is hard. "It took us a month to figure out how to instrument MongoDB to get a consistent snapshot of all these values at one moment." MongoDB's fine-grained locking (designed to avoid a global lock) made point-in-time state capture require a custom instrumentation branch — which ate the budget and raised the question of whether the modified binary still exercised the same code paths as production.
Lesson 2: The implementation must actually conform to the spec. "This is obvious to me now." RaftMongo.tla modeled leader handoff as a single atomic action: old leader steps down and new leader steps up at the same instant. The real implementation steps the old leader down before stepping the new leader up. Davis + Schvimer knew about the gap and tried to paper over it in the Python trace-translator — "it never worked". The eXtreme Modelling recipe (write the spec just prior to the implementation, evolve them together) would have prevented it; a spec written years after the implementation, at a high abstraction level, has irreconcilable gaps.
Lesson 3: Trace-checking setup must generalise across specs. 10 weeks of instrumentation work was ~100% specific to RaftMongo.tla. Extending to another spec (replication, transactions, sharding, …) meant starting over. Against the vision of check every trace against every spec on every commit, the marginal cost was too high. Project cancelled.
Test-case generation for the Mobile SDK's OT merge rules SUCCEEDED. The MongoDB Mobile SDK (since sunset as Atlas Device Sync) was a mobile-local database syncing with a central server via Operational Transform. Two implementations of the same OT algorithm (C++ client, Go server) needed to agree on conflict resolution for eventually-consistent replicas. Hirschhorn manually translated the ~1000-line C++ array-merge portion (21 rules for 6 array operations) into TLA+ in two weeks. The model-checker caught TLA+ mistakes quickly, closing the loop.
TLC crashed on a real bug. One write-conflict case — one participant swaps two array elements, another moves an element — triggered a Java StackOverflowError in TLC due to infinite recursion in the algorithm itself. Verified in the C++ too. Fix: deprecate the element-swap operation in both the spec and the Mobile SDK. The TLA+ translation acted as an infinite-recursion finder that neither C++ compilation nor unit tests had found.
Constrained state space → full test enumeration. Constrained the algorithm to 3 participants × 3-element array × 1 write op each. State space: 30,184 states / 4,913 behaviours — finite, enumerable DAG. Go program parsed TLC's state-graph output and emitted one C++ unit test per behaviour. 4,913 tests achieved 100% branch coverage vs 21% from handwritten and 92% from millions of AFL-fuzzer runs. This is the conformance-checking win: complete coverage of a production implementation, derived from the spec, via the test-case-generation pattern.
Where we'd do it differently. Retrospective guidance for trace-checking a big system: (a) ensure spec + impl conform at the start (eXtreme Modelling), fix discrepancies by changing one side or the other immediately rather than papering over; (b) model easily observable events (network messages) instead of internal state of multithreaded processes — network logs are atomic by the wire boundary; (c) develop trace-checking infra that is spec-generic, not spec-specific. Ongoing work: MongoDB sponsors Finn Hackett (PhD, UBC) to continue trace-checking research; summer 2025 collaboration with Antithesis.
The five-year state of the field (2020 → 2025). Davis surveys the academic progress since the VLDB paper. Named systems (all use some variant of "write new specs aligned with impl, generate / validate traces at production scale"):
- Mocket — Java-specific; generates tests from TLA+ and instruments Java to force deterministic execution + check variable equality against spec.
- Multi-Grained Specifications for Zookeeper — Will Schultz + Murat Demirbas; new TLA+ specs at multiple abstraction levels for Zookeeper; human instrumenting Java to map variables.
- SandTable — language-agnostic; overrides system calls to control nondeterminism; samples spec state space for coverage + diversity; emphasizes writing new specs matching the implementation.
- Design and Modular Verification of Distributed Transactions in MongoDB (VLDB 2025) — Will Schultz + Murat Demirbas; test-case generation against a new TLA+ spec of MongoDB's WiredTiger storage layer.
- Protocol Conformance with Choreographic PlusCal — new high-level PlusCal-like language compiling to TLA+; generates Go trace-logging functions.
- Validating Traces of Distributed Programs Against TLA+ Specifications — veteran formal-methods authors demonstrate Java trace-logging + TLC validation of incomplete traces (infer missing variable values).
- Smart Casual Verification of the Confidential Consortium Framework — identical setup to MongoDB 2020 (new specs of a big old C++ program); after months of effort, succeeded using the 2024 TLC features.
- Finn Hackett's TLA+-to-Go compiler + trace-checker prototype. Common thread in all recent wins: write new specs co-designed with the implementation, not old abstract specs; instrument enough to observe the right events, not all state.
Placement vs adjacent sources. MongoDB's trace-checking failure and OT test-case-generation success bracket the conformance-checking design space from a different angle than the wiki's existing sources:
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 (AWS ShardStore) — write a compact executable spec in the same language as production + property-test; no TLA+. Succeeded by avoiding the TLA+ translation step altogether.
- sources/2024-05-31-dropbox-testing-sync-at-dropbox-2020 (Dropbox Nucleus) — no adjacent spec at all; testability is an architectural concern, the implementation is engineered so deterministic randomized tests can be written directly against it.
- sources/2026-02-17-allthingsdistributed-byron-cook-automated-reasoning-trust-ai (AWS Byron Cook) — productises specification-driven development via systems/kiro + systems/bedrock-guardrails-automated-reasoning-checks; autoformalization collapses the spec-authoring ramp for non-specialists. MongoDB tried the oldest shape (TLA+ spec + real code) at production scale and documented which half of it works and which doesn't. That's the unique contribution.

Operational numbers¶

~300 handwritten JavaScript integration tests from the MongoDB server suite were selected as trace-checking inputs, alongside the rollback fuzzer randomized test (CRUD ops + random partition create/heal).
3-node replica set per test run, all processes on one machine communicating over localhost → clock-synchronization was not a concern (merge logs by timestamp).
10 weeks of effort on trace-checking before project termination; ~1 month spent on the multithreaded-state-snapshot instrumentation alone.
~1000 lines of C++ for the array-merge portion of the MongoDB Mobile SDK's OT; ~2 weeks to translate manually to TLA+.
19 client operation types; 6 array operations; 21 array merge rules.
State graph constrained to 3 participants × 3-element array × 1 write op each → 30,184 states / 4,913 behaviours — a finite enumerable DAG.
4,913 generated C++ unit tests → 100% branch coverage of the OT implementation. Handwritten tests reached 21%; millions of AFL fuzzer runs reached 92%.
1 real bug found — infinite recursion triggered by swap+move conflict. Deprecated in the algorithm + implementation.

Caveats¶

Project timing: 2020. Written-up in 2025 with a five-year retrospective lens, but the core work is five years old. TLC has since gained incomplete-trace inference features (announced with the "Validating Traces" paper) that might change the tractability conclusion for trace-checking; Davis acknowledges this.
Product sunset. The MongoDB Mobile SDK / Atlas Device Sync — where test-case generation worked — has been sunset. The TLA+ spec + test harness aren't actively maintained on a live product. The concrete conformance-checking result (100% branch coverage, one algorithm bug fixed) stands; the long-term maintenance burden was never tested.
Trace-checking didn't fail on the mechanics; it failed on mismatched abstraction levels. The lesson "the implementation must conform to the spec" means the spec RaftMongo.tla was written at a higher abstraction than the implementation in ways the team knew about at the start. A spec written just before the corresponding implementation (as eXtreme Modelling prescribes) would change this calculus — but MongoDB's specs were written long after the already-shipped server.
State-space enumeration only scales to small constraints. 3×3×1 was tractable; broader input shapes blow up. Test-case-generation is bounded by what fits in TLC's model-checker on the constrained state space.
The 2025 field survey is descriptive, not evaluated. Davis lists recent papers but doesn't compare them head-to-head or report production results from any of them (except the 2025 CCF paper, which he notes ran for "months" of spec+impl-alignment work).

Source¶

Original: https://www.mongodb.com/blog/post/engineering/conformance-checking-at-mongodb-testing-our-code-matches-our-tla-specs
Raw markdown: raw/mongodb/2025-06-02-conformance-checking-at-mongodb-testing-that-our-code-matche-a395ebf6.md
VLDB 2020 paper: eXtreme Modelling in Practice
Author's VLDB talk: YouTube (linked from original)
HN discussion: #44163496 (110 points)

concepts/conformance-checking — the core concept named by the article
concepts/temporal-logic-specification — the family TLA+ belongs to
concepts/lightweight-formal-verification — sister tradition (AWS ShardStore)
concepts/specification-driven-development — the broader movement this slots into
concepts/bisimulation — the Bspec ⊂ Bimpl AND Bimpl ⊂ Bspec target
concepts/operational-transform — the Mobile SDK conflict-resolution algorithm
patterns/trace-checking — the technique that failed
patterns/test-case-generation-from-spec — the technique that succeeded
patterns/extreme-modelling — the 2011 methodology that frames the work
patterns/property-based-testing — the peer technique against which spec-derived tests compete
systems/tla-plus — the specification language
systems/tlc-model-checker — the checker that runs the specs
systems/raft-mongo-tla — the specific spec that failed trace-checking
systems/mongodb-server — the implementation side of the failed experiment
systems/mongodb-mobile-sdk — the implementation side of the successful experiment
companies/mongodb