Skip to content

PATTERN Cited by 2 sources

eXtreme Modelling

Pattern

eXtreme Modelling (XM) is the methodology proposed by Methni, Madeira, Chafia, and Koutny in the 2011 paper Concurrent Development of Model and Implementation for using formal-specification languages (TLA+, Alloy, Event-B) in production software development. It prescribes four practices:

  1. Multiple specifications model aspects of the system (not one monolithic spec).
  2. Specifications are written just prior to the implementation (not years after).
  3. Specifications evolve with the implementation (never frozen).
  4. Tests are generated from the model (see patterns/test-case-generation-from-spec), and/or trace-checking verifies that test traces are legal in the specification.

Practice (4) is the conformance-checking half; (1)-(3) are the methodology.

The problem it targets

Without eXtreme Modelling, two industrial failure modes dominate (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

  • Monolithic spec rots into impracticality. An engineer tries to write one huge TLA+ spec for the whole system; it grows too complex to understand, state-space-explosion defeats model checking, and the author abandons it concluding "TLA+ is impractical". A. Jesse Jiryu Davis's framing.
  • Spec-impl abstraction mismatch. A spec written years after the implementation, at a much higher abstraction level, doesn't align with how the impl actually transitions states. Conformance checking against it becomes a paper-over exercise that never converges. (MongoDB 2020 is the canonical failure: RaftMongo.tla modeled leader handoff as atomic; the real server steps the old leader down before stepping the new leader up; 10 weeks of translator workarounds never reconciled the gap.)

XM addresses both: (1) many small specs prevents monolithic rot; (2)-(3) prevents abstraction drift.

The practices in detail

1. Multiple specifications model aspects

One big spec → state-space explosion + unreviewable.

Many small specs → each model-checks in reasonable time, each is reviewable by the team that owns that aspect, each can be trace-checked or test-case-generated independently.

Example (MongoDB): instead of one spec modelling the whole server, the team has separate specs for consensus/commit-point propagation (RaftMongo.tla), for replication, for transactions, for WiredTiger storage — each ~100s of lines of TLA+ covering one aspect.

2. Write specs just before implementing

Just-before not just-after. Writing the spec forces the author to think through the algorithm, state variables, transitions, invariants — before writing code that commits to concrete data structures and control flow.

This is the ordering lesson MongoDB called out directly:

I can imagine another world where we knew about eXtreme Modelling and TLA+ at the start, when we began coding MongoDB. In that world, we wrote our spec before the implementation, with trace-checking in mind. The spec and implementation would've been structured similarly, and this would all have been much easier.

The spec and impl then structurally match — atomic actions in the spec correspond to atomic transitions in the impl, state variables are literally the same variables. Conformance checking works because the abstractions align by construction.

3. Evolve specs with the implementation

A spec frozen at day 1 drifts from the impl as the impl evolves. Each feature added to the impl either:

  • Has a corresponding spec update (fits XM), or
  • Silently changes impl behaviour without changing the spec — at which point the spec is a lie.

XM requires the spec to be a maintained artifact — part of every feature's definition of done is updating the spec, and the conformance check on every commit catches the drift if the engineer forgets.

4. Continuous conformance checking

The spec is only useful if it's mechanically connected to the impl via patterns/trace-checking or patterns/test-case-generation-from-spec running on every commit. A spec that doesn't stop a merge when it diverges from the impl is documentation at best; often lie.

MongoDB's reported adherence

From A. Jesse Jiryu Davis's 2025 retrospective (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

  • Multiple specifications — yes, adopted. MongoDB maintains a handful of specs (RaftMongo.tla + others) each focused on one aspect. "This was the direction MongoDB was already going."
  • Specs written just before implementationnot adopted. The specs were written years after the C++ server shipped. This is the root cause of the 2020 trace-checking failure.
  • Specs evolve with implementation — partial. Specs are maintained but on a different cadence than the code.
  • Conformance checking — the 2020 paper was an experiment in adding this; trace-checking failed, test-case generation succeeded on the Mobile SDK. Ongoing: MongoDB sponsors Finn Hackett (UBC) to research trace-checking infrastructure; 2025 Antithesis collaboration; VLDB 2025 paper with Schultz + Demirbas on transaction conformance.

Adoption data points since 2020

All post-2020 successful conformance-checking results Davis surveys share writing new specs co-designed with the implementation (XM practice 2):

  • SandTable"wisely developed new TLA+ specs that closely matched the implementations they were testing, rather than trying to use existing, overly abstract specs".
  • Multi-Grained Zookeeper — wrote new TLA+ specs at higher and lower levels of abstraction; conformance-check against the most-concrete one.
  • Smart Casual Verification of CCF — started with a big old C++ program; wrote new specs; spent months aligning specs and code until conformance held. Davis notes this is the closest parallel to MongoDB's 2020 failure — same setup, different outcome because they ground away at alignment instead of giving up.
  • VLDB 2025 MongoDB transactions paper — new spec of WiredTiger matched to the storage-layer implementation.

The lesson is prescriptive: if you are applying XM to an old codebase, expect to write a new spec that matches your impl, not dust off an old abstract spec.

Relation to neighbours

  • concepts/specification-driven-development — the broader movement XM is one methodology for. XM is about when and at what granularity you write specs; spec-driven development is about treating the spec as a first-class customer-visible artifact. Complementary, not competing.
  • concepts/lightweight-formal-verification — AWS ShardStore's stance — same spirit (small continuously-checked specs) but the specs are written in the same language as production (executable specs) rather than in TLA+. Both avoid the monolithic-spec failure. XM keeps TLA+'s richer language; executable specs trade spec-language power for author-ramp collapse.
  • concepts/conformance-checking — XM's practice (4). Without mechanical conformance checking, XM is just "write good docs and update them" — the discipline collapses under normal schedule pressure.

Seen in

  • sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — canonical industry articulation; names eXtreme Modelling explicitly, cites the 2011 paper, narrates MongoDB 2020's partial adoption (multiple specs yes; just-before ordering no), frames the retrospective lesson as "we wrote our spec before the implementation, with trace-checking in mind. The spec and implementation would've been structured similarly, and this would all have been much easier"; post-2020 academic successes (SandTable, Multi-Grained Zookeeper, CCF) all rediscover XM's practice 2 independently.
  • sources/2026-02-27-mongodb-towards-model-based-verification-of-a-key-value-storage-engineMongoDB's 2025 realisation of XM practice 1 (multiple specs) + 2 (just-before ordering) in one artefact: the cross-shard transactions TLA+ spec is written compositionally with an explicit interface boundary to WiredTiger, extracted as a standalone Storage.tla module. Practice 4 (test-case generation from model) is realised end-to-end on the storage-interface module — 87,143 generated tests for a 2×2 finite model in ~40 minutes. A working instance of all four XM practices on the same codebase the 2020 experiment failed on; the difference is that the 2025 spec was co-designed with the VLDB 2025 protocol work, not retrofitted years later.
Last updated · 200 distilled / 1,178 read