Skip to content

PATTERN Cited by 1 source

Trace-checking

Pattern

Trace-check a production implementation against a formal specification: instrument the impl to log state transitions, collect execution traces from real / stress / randomized tests, translate each trace into the specification language, and ask the model-checker whether the trace is a valid behaviour of the spec. A rejected trace means Bimpl ⊄ Bspec — the implementation did something the spec forbids — so either the impl is buggy or the spec is wrong.

This is the direction from conformance checking that starts with the implementation. The complementary direction is test-case generation from a spec which drives the implementation off enumerated spec behaviours.

The mechanics

Following Ron Pressler's technique used in the MongoDB 2020 experiment (Source: sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs):

  1. Pick a TLA+ spec covering some aspect of the system. (MongoDB: RaftMongo.tla for the commit-point protocol.)
  2. Run tests of the implementation — unit, integration, randomized — with state-transition logging enabled on every node.
  3. Collect execution traces. For multi-node tests, each node emits its own log; merge the logs by timestamp (if nodes share a clock) or by wire-observable events.
  4. Translate each trace into TLA+. Write a script that reads the log and emits a Trace.tla asserting a sequence of states:
    VARIABLES state_0, state_1, state_2, ...
    Init == state_0 = <<initial state>>
    Next == state_0' = <<hardcoded next state from log>>
    
    with one action per logged transition.
  5. Assert conformance. Add a property Conforms == Spec (where Spec is the imported spec's behaviour formula) to Trace.tla.
  6. Run TLC on Trace.tla. If TLC says the trace is a valid behaviour of Spec, conformance holds for this run. If not, Bimpl ⊄ Bspec and you have a bug.

Why it's hard (MongoDB's three lessons)

The MongoDB 2020 experiment abandoned trace-checking after 10 weeks of effort. Three named failure modes:

1. Multithreaded-state snapshotting is expensive

It took us a month to figure out how to instrument MongoDB to get a consistent snapshot of all these values at one moment.

A TLA+ state is an assignment of values to all state variables at one instant. For a multithreaded process with fine-grained locking (MongoDB avoids a global lock by design), capturing all state variables at a single moment requires invasive instrumentation: either a custom lock ordering that lets a reader thread grab them all, or language-level machinery like Rust's typed invariants. A month of work spent on the mechanism — not yet on the actual conformance question — is the warning shape.

Mitigation (from the 2024 Validating Traces paper): log only some state variables; let TLC infer the missing ones. Reduces the atomic-snapshot cost at the price of more permissive checks. Wasn't a standard TLC feature in 2020; is now.

Alternative (Davis's 2025 retrospective): model easily-observable events — network messages between nodes — rather than internal state. Messages are atomic by the wire boundary and easy to log without instrumenting the producer's locking discipline.

2. The spec must actually match the implementation

In our real-life implementation, when an old leader votes for a new one, first the old leader steps down, then the new leader steps up. The spec we chose for trace-checking wasn't focused on the election protocol, though, so for simplicity, the spec assumed these two actions happened at once.

A spec written at a higher abstraction level than the impl can encode assumptions the impl violates — atomic actions that aren't atomic, ordered events without explicit ordering, merged states that are really two distinct states. Davis and Schvimer tried to hide the discrepancy in the Python trace-translator. "It never worked."

Mitigation (from the eXtreme Modelling recipe): write the spec just prior to the implementation, not years after. Abstractions align by construction. Fix discrepancies by changing either side the moment they surface.

Practical implication: if you're conformance-checking an old codebase, expect to write a new spec matching the code rather than reusing an old abstract spec. Every post-2020 academic success Davis surveys (Mocket, SandTable, Multi-Grained Zookeeper, Confidential Consortium Framework) did this.

3. Trace-checking doesn't generalise across specs

Judah and I put in 10 weeks of effort without successfully trace-checking one spec, and most of the work was specific to that spec, RaftMongo.tla.

The instrumentation, log format, translator script, and abstraction-reconciliation work is mostly per-spec. Extending to another spec (transactions, sharding, storage) means starting over. Against the vision of every trace, every spec, every commit, the marginal cost is prohibitive.

Mitigation: invest up-front in spec-agnostic trace-checking infrastructure. The 2024 Validating Traces work explicitly targets this with generic TLC features for incomplete trace inference. Finn Hackett's TLA+-to-Go toolchain aims for the same — one compiler, reusable trace-checker per generated Go artifact.

When trace-checking is the right tool

  • Highly concurrent / distributed systems where enumerating all spec behaviours for test-case generation is infeasible (state-space explosion). Trace-checking rides real (or randomized) executions, so coverage is whatever the test suite produces — unbounded in principle.
  • Systems with clean observable events at process boundaries (RPCs, wire messages, state snapshots written to a log). Logging the boundary is much cheaper than logging internal state.
  • Spec written just before impl, in the eXtreme Modelling style, so abstractions match.

When trace-checking doesn't fit

  • Spec written years after a production impl at a much higher abstraction level. Plan to rewrite the spec or use a different technique.
  • Atomic-snapshot cost prohibitive. Multithreaded, globally-locked, or high-rate internal-state-change systems where instrumenting every transition perturbs behaviour.
  • One-shot conformance effort. Trace-checking amortises across many specs; a single conformance check is more cheaply done with test-case generation on a narrow spec.

Relation to neighbours

  • patterns/test-case-generation-from-spec — the other conformance-checking direction. Succeeds where trace-checking fails (narrow algorithms with bounded state space); fails where trace-checking could succeed (open-ended distributed runtime). MongoDB 2020 attempted both in parallel.
  • concepts/lightweight-formal-verification — skips TLA+ entirely; writes an executable spec in the production language and property-tests impl against it. No trace-translator, no abstraction-mismatch. AWS ShardStore is the canonical realization.
  • patterns/property-based-testing — tests invariants on the impl itself, no adjacent spec. The Dropbox Nucleus choice; complementary to trace-checking, not a substitute.

Seen in

Last updated · 200 distilled / 1,178 read