DROPBOX 2024-05-31 Tier 2

Dropbox — Testing sync at Dropbox (2020)¶

Summary¶

Isaac Goldberg's (Dropbox) walkthrough of the testing strategy that allowed the team to rewrite Sync Engine Classic into Nucleus — the engine running on hundreds of millions of machines — without regressing a decade of bug fixes. The thesis: testability is an architectural concern, not a QA concern; they designed the data model, protocol, and concurrency model of Nucleus specifically so that deterministic, reproducible randomized testing frameworks could be built on top. Two such frameworks are described: CanopyCheck (QuickCheck-style property-based testing of the planner) and Trinity (end-to-end concurrency testing via a custom futures executor that also mocks filesystem, network, and time). The post is the canonical industry articulation of deterministic simulation testing as a production-engineering discipline for distributed-state consumer software — the same family as AWS's ShardStore executable spec, FoundationDB's simulation framework, and TigerBeetle's VOPR.

Key takeaways¶

Testability is a property of the architecture, not the test suite. Sync Engine Classic's permissive client-server protocol, path-keyed data model, and "components fork threads freely + coordinate via global locks" concurrency model made deterministic testing structurally impossible. Nucleus re-designed all three to unlock it. systems/dropbox-sync-engine-classic → systems/dropbox-nucleus
Design away invalid system states is a core Nucleus tenet. Example: in Sync Engine Classic the client could receive metadata for /baz/cat before its parent /baz; the local SQLite schema had to support orphaned nodes, which then made real orphan-bugs indistinguishable from transient-ok states. Nucleus's protocol rejects the parentless-node case at the wire, so the database can enforce "no node exists without a parent" as a testable invariant.
Nucleus persists observations, not work. The data model is three trees — Remote (server state), Local (on-disk state), Synced (merge base) — from which the correct sync operations are derived. Sync Engine Classic persisted the pending work itself (create-here, upload-there). The observations model lets the planner compute operations as a pure function of the three trees, and enables the clean test goal "all three trees must converge to the same state." The Synced Tree is the innovation: it is a per-node merge base that disambiguates "was this file added remotely?" from "was this file deleted locally?" (concepts/merge-base-three-tree-sync)
Unique node IDs turn renames/moves from O(n) into O(1). Classic keyed nodes by path, so a folder rename fanned out to O(descendants) deletes+adds, transiently exposing two inconsistent subtrees to the user. Nucleus represents nodes by a unique ID, so a rename is one attribute update + one atomic filesystem-level rename(2). The invariant "a moved folder is visible in exactly one location" is cheaply enforceable under this representation; it was structurally false in Classic.
Single-threaded control + dedicated offload pools is the concurrency model that enables serialization-under-test. Nearly all Nucleus code runs on one "control" thread. Blocking/parallelizable work (network I/O, filesystem I/O, hashing) is offloaded to dedicated thread pools. Under test, async requests are serialized onto the main thread, so the entire engine runs single-threaded-deterministic. This is the load-bearing substrate under both CanopyCheck and Trinity.
Fully deterministic randomized testing is a non-negotiable developer-experience requirement. The team's self-imposed rule: "All randomized testing frameworks must be fully deterministic and easily reproducible" — because failing-only-intermittently, hard-to- reproduce randomized tests are a well-known source of time sinks (Sync Engine Classic had many). Shape of every Nucleus random test: (1) generate seed; (2) instantiate PRNG (IsaacRng cited); (3) run entire test — init state, scheduling, failure injection — off that one PRNG; (4) on failure, output seed. The engineer can then add logging inline and rerun — "it's guaranteed to fail again." This is patterns/seed-recorded-failure-reproducibility. Scale: tens of millions of random runs per night against master, "in general 100% green." Regressions auto-file a task carrying seed + commit hash.
Determinism must extend below the test framework into the system under test. Rust's default HashMap uses randomized hashing for DoS resistance; Nucleus overrides with a deterministic hasher because an adversarial user can only degrade their own performance via collisions. The commit hash is an input to reproducibility alongside the PRNG seed: "if the code changes, the course of execution may change too."
CanopyCheck is narrow + property-based: randomize the three trees, run the planner, assert invariants. Runtime loop: (1) ask planner for a batch of concurrent ops; (2) randomly shuffle the batch (verifies order doesn't matter); (3) pretend each op succeeded by updating the trees; (4) repeat until planner returns nothing; (5) check invariants. Enforced invariants: termination (heuristic cutoff at 200 iterations), no panics (exhaustively covers assert! defensive checks), sync correctness (all three trees equal at end) + stronger asymmetric invariants (e.g. "a file only on Remote at start must exist in all three trees at end" — catches the degenerate "planner deletes everything" attack on the equality invariant). CanopyCheck caught an early Archives/Drafts/January directory-cycle bug (local move + remote move → cycle → tree-data-structure assert). Input shape so simple → QuickCheck-style shrinking works: iteratively remove nodes from the initial trees, re-run, keep shrinking if the failure persists (concepts/test-case-minimization).
Generating coverage-useful random trees requires correlation. Naive "three independent random trees" gives disjoint paths → the planner never exercises its delete/edit/move logic. CanopyCheck first generates one tree, then perturbs it into the other two — better exploration of the actually-interesting sync scenarios while still random.
**Trinity is end-to-end: custom futures executor
- mocked filesystem/network/time. Nucleus is one giant impl Future<Output = !> (never-returns) composed of per-worker futures (e.g. the upload worker is a FuturesUnordered over concurrent network requests). Trinity is a custom executor for that future. On every main-loop iteration it interleaves: call poll() on Nucleus; call poll() on the intercepted mock filesystem/network requests; run its own perturbation code — modifying the local or remote filesystem, reordering/failing RPCs, advancing the mockable timer arbitrarily, simulating crashes by snapshotting/restoring the in-memory filesystem. At sync completion, asserts consistency and re-runs with the same seed to verify determinism of the determinism claim itself.
In-memory mocks give ~10× performance and full failure-injection leverage. Trinity runs against an in-memory filesystem mock (snapshot/restore for crash simulation, arbitrary-order fault injection, ~10× faster than native FS) and a full-Rust backend mock (metadata DB, content storage, notifications — "all server-side services Nucleus depends on" emulated). A separate Trinity Native mode uses the real platform filesystem (10× slower, fewer seeds, but covers OS-specific permissions / xattrs / Smart Sync placeholder hydration etc.).
Every mocking decision is a coverage trade-off. Post names three explicit gaps: (a) native-filesystem coverage — Trinity Native serializes native syscalls for determinism, but real users interleave them freely; Trinity cannot reboot mid-test, so fsync durability on each platform is out of scope. (b) Network protocol drift — the Rust backend mock may drift from the real servers; a separate Heirloom suite uses the same seeded-PRNG discipline but talks to a real Dropbox backend (weaker determinism, ~100× slower than Trinity). (c) Minimization doesn't scale end-to-end — even a small perturbation to the initial trees changes RPC scheduling downstream and invalidates a hard-won seed. Mitigation under consideration: decouple the global PRNG into per-subsystem PRNGs so perturbing one doesn't invalidate the others.

Architectural numbers & shapes¶

Sync Engine Classic age: 12+ years running in production at time of writing.
Deployment target: "hundreds of millions of user machines," each a "wildly different environment."
Random-test volume: tens of millions of runs per night.
Determinism contract: (seed, commit hash) → exact execution trace.
Trinity Native vs Trinity: ~10× slower on native filesystem than in-memory mock.
Heirloom vs Trinity: ~100× slower (talks to real backend).
Termination heuristic: 200 planning iterations cap.

Concepts introduced / reinforced¶

concepts/deterministic-simulation — new. (seed, commit) → reproducible full-system trace; the canonical production-engineering discipline behind CanopyCheck + Trinity + FoundationDB simulation + TigerBeetle VOPR.
concepts/design-away-invalid-states — new. Core Nucleus tenet: when invalid states exist in the data model / protocol, real-bug detection is impossible (orphan-node example).
concepts/merge-base-three-tree-sync — new. Three-tree (Remote / Local / Synced) observations model that makes sync direction computable and the convergence goal cleanly expressible.
concepts/test-case-minimization — new. QuickCheck-style iterative shrinking of a failing input (trees minus nodes); Trinity's scope limit ("everything changes if you perturb anything") is the contrast case.
concepts/lightweight-formal-verification — reinforced. Same family as ShardStore's executable spec: "the test is a well-defined property of the system" rather than "the test reimplements the system."
concepts/memory-safety — reinforced. Rust mentioned as the implementation substrate; follow-up post promised on "how we leverage Rust's type system" for the design-away-invalid-states principle.

Patterns introduced / reinforced¶

patterns/property-based-testing — new. QuickCheck-lineage: randomize typed inputs; assert invariants rather than specific outputs; shrink on failure.
patterns/seed-recorded-failure-reproducibility — new. On failure, output (seed, commit-hash). Engineer adds logging inline and re-runs — "guaranteed to fail again." The developer-experience contract that keeps randomized testing usable at tens-of-millions of runs per night.
patterns/single-threaded-with-offload — new. Main control thread + dedicated offload pools; serializable under test so the whole engine runs deterministic-single-threaded. The architectural substrate beneath CanopyCheck and Trinity.

Systems named¶

systems/dropbox-nucleus — the new sync engine (Rust; three-tree data model; single control thread + offload; impl Future all the way down).
systems/dropbox-sync-engine-classic — the legacy system, the "why this rewrite was hard" foil.
systems/canopycheck — QuickCheck-inspired randomized property testing for the planner only.
systems/trinity — end-to-end randomized testing with mocked FS/network/time; custom Rust futures executor drives Nucleus.
systems/heirloom — separate test suite talking to a real Dropbox backend (~100× slower than Trinity); deferred to a future post.

Cross-references¶

sources/2025-06-02-mongodb-conformance-checking-at-mongodb-testing-that-our-code-matches-our-tla-specs — closest sibling on the testing-distributed-systems axis, different stance. Dropbox Nucleus: implementation is its own spec; invariants asserted on observed state; deterministic simulation + property tests as the tools. MongoDB 2020: TLA+ spec adjacent to the implementation; patterns/trace-checking + patterns/test-case-generation-from-spec as the tools. Dropbox's choice avoids the spec-impl abstraction-mismatch failure mode that killed MongoDB's trace-checking of RaftMongo.tla — at the price of not having an independent artifact that says what the system should do. The Davis 2025 retrospective makes clear that writing the spec just before (not after) the implementation (patterns/extreme-modelling) is what keeps the TLA+-spec route viable — an ordering Dropbox's Nucleus rewrite didn't need to think about because there's no spec.

Caveats¶

Timing: post is dated 2024-05-31 but describes work done for the Nucleus rewrite which shipped around 2020 (the URL slug hedges "2020"). Architectural choices described should be read as 2020-era Dropbox, lightly updated.
No hard bug-find counts beyond the Archives/Drafts/January cycle anecdote; "enormous number of bugs" is qualitative.
No defect-escape-rate comparison between Classic and Nucleus; the claim is that the rewrite didn't regress Classic's stability, not that it measurably improved a metric.
Heirloom internals are deferred to a future post.
Rust type-system exploitation for invalid-state-design-away is also deferred to a future post. (The prior post on the rewrite decision is linked: Rewriting the heart of our sync engine.)
Not a vendor pitch — author is named (Isaac Goldberg); special thanks to Ben Blum, Sujay Jayakar, Geoff Song, John Lai, Liam Whelan, Gautam Gupta, Iulia Tamas, Sync team. HN discussion at 68 points.

Source¶

Original: https://dropbox.tech/infrastructure/-testing-our-new-sync-engine
Raw markdown: raw/dropbox/2024-05-31-testing-sync-at-dropbox-2020-66279184.md