Skip to content

DROPBOX 2024-05-31 Tier 2

Read original ↗

Dropbox — Testing sync at Dropbox (2020)

Summary

Isaac Goldberg's (Dropbox) walkthrough of the testing strategy that allowed the team to rewrite Sync Engine Classic into Nucleus — the engine running on hundreds of millions of machines — without regressing a decade of bug fixes. The thesis: testability is an architectural concern, not a QA concern; they designed the data model, protocol, and concurrency model of Nucleus specifically so that deterministic, reproducible randomized testing frameworks could be built on top. Two such frameworks are described: CanopyCheck (QuickCheck-style property-based testing of the planner) and Trinity (end-to-end concurrency testing via a custom futures executor that also mocks filesystem, network, and time). The post is the canonical industry articulation of deterministic simulation testing as a production-engineering discipline for distributed-state consumer software — the same family as AWS's ShardStore executable spec, FoundationDB's simulation framework, and TigerBeetle's VOPR.

Key takeaways

  1. Testability is a property of the architecture, not the test suite. Sync Engine Classic's permissive client-server protocol, path-keyed data model, and "components fork threads freely + coordinate via global locks" concurrency model made deterministic testing structurally impossible. Nucleus re-designed all three to unlock it. systems/dropbox-sync-engine-classicsystems/dropbox-nucleus
  2. Design away invalid system states is a core Nucleus tenet. Example: in Sync Engine Classic the client could receive metadata for /baz/cat before its parent /baz; the local SQLite schema had to support orphaned nodes, which then made real orphan-bugs indistinguishable from transient-ok states. Nucleus's protocol rejects the parentless-node case at the wire, so the database can enforce "no node exists without a parent" as a testable invariant.
  3. Nucleus persists observations, not work. The data model is three trees — Remote (server state), Local (on-disk state), Synced (merge base) — from which the correct sync operations are derived. Sync Engine Classic persisted the pending work itself (create-here, upload-there). The observations model lets the planner compute operations as a pure function of the three trees, and enables the clean test goal "all three trees must converge to the same state." The Synced Tree is the innovation: it is a per-node merge base that disambiguates "was this file added remotely?" from "was this file deleted locally?" (concepts/merge-base-three-tree-sync)
  4. Unique node IDs turn renames/moves from O(n) into O(1). Classic keyed nodes by path, so a folder rename fanned out to O(descendants) deletes+adds, transiently exposing two inconsistent subtrees to the user. Nucleus represents nodes by a unique ID, so a rename is one attribute update + one atomic filesystem-level rename(2). The invariant "a moved folder is visible in exactly one location" is cheaply enforceable under this representation; it was structurally false in Classic.
  5. Single-threaded control + dedicated offload pools is the concurrency model that enables serialization-under-test. Nearly all Nucleus code runs on one "control" thread. Blocking/parallelizable work (network I/O, filesystem I/O, hashing) is offloaded to dedicated thread pools. Under test, async requests are serialized onto the main thread, so the entire engine runs single-threaded-deterministic. This is the load-bearing substrate under both CanopyCheck and Trinity.
  6. Fully deterministic randomized testing is a non-negotiable developer-experience requirement. The team's self-imposed rule: "All randomized testing frameworks must be fully deterministic and easily reproducible" — because failing-only-intermittently, hard-to- reproduce randomized tests are a well-known source of time sinks (Sync Engine Classic had many). Shape of every Nucleus random test: (1) generate seed; (2) instantiate PRNG (IsaacRng cited); (3) run entire test — init state, scheduling, failure injection — off that one PRNG; (4) on failure, output seed. The engineer can then add logging inline and rerun — "it's guaranteed to fail again." This is patterns/seed-recorded-failure-reproducibility. Scale: tens of millions of random runs per night against master, "in general 100% green." Regressions auto-file a task carrying seed + commit hash.
  7. Determinism must extend below the test framework into the system under test. Rust's default HashMap uses randomized hashing for DoS resistance; Nucleus overrides with a deterministic hasher because an adversarial user can only degrade their own performance via collisions. The commit hash is an input to reproducibility alongside the PRNG seed: "if the code changes, the course of execution may change too."
  8. CanopyCheck is narrow + property-based: randomize the three trees, run the planner, assert invariants. Runtime loop: (1) ask planner for a batch of concurrent ops; (2) randomly shuffle the batch (verifies order doesn't matter); (3) pretend each op succeeded by updating the trees; (4) repeat until planner returns nothing; (5) check invariants. Enforced invariants: termination (heuristic cutoff at 200 iterations), no panics (exhaustively covers assert! defensive checks), sync correctness (all three trees equal at end) + stronger asymmetric invariants (e.g. "a file only on Remote at start must exist in all three trees at end" — catches the degenerate "planner deletes everything" attack on the equality invariant). CanopyCheck caught an early Archives/Drafts/January directory-cycle bug (local move + remote move → cycle → tree-data-structure assert). Input shape so simple → QuickCheck-style shrinking works: iteratively remove nodes from the initial trees, re-run, keep shrinking if the failure persists (concepts/test-case-minimization).
  9. Generating coverage-useful random trees requires correlation. Naive "three independent random trees" gives disjoint paths → the planner never exercises its delete/edit/move logic. CanopyCheck first generates one tree, then perturbs it into the other two — better exploration of the actually-interesting sync scenarios while still random.
  10. **Trinity is end-to-end: custom futures executor
    • mocked filesystem/network/time. Nucleus is one giant impl Future<Output = !> (never-returns) composed of per-worker futures (e.g. the upload worker is a FuturesUnordered over concurrent network requests). Trinity is a custom executor for that future. On every main-loop iteration it interleaves: call poll() on Nucleus; call poll() on the intercepted mock filesystem/network requests; run its own perturbation code — modifying the local or remote filesystem, reordering/failing RPCs, advancing the mockable timer arbitrarily, simulating crashes by snapshotting/restoring the in-memory filesystem. At sync completion, asserts consistency and re-runs with the same seed to verify determinism of the determinism claim itself.
  11. In-memory mocks give ~10× performance and full failure-injection leverage. Trinity runs against an in-memory filesystem mock (snapshot/restore for crash simulation, arbitrary-order fault injection, ~10× faster than native FS) and a full-Rust backend mock (metadata DB, content storage, notifications — "all server-side services Nucleus depends on" emulated). A separate Trinity Native mode uses the real platform filesystem (10× slower, fewer seeds, but covers OS-specific permissions / xattrs / Smart Sync placeholder hydration etc.).
  12. Every mocking decision is a coverage trade-off. Post names three explicit gaps: (a) native-filesystem coverage — Trinity Native serializes native syscalls for determinism, but real users interleave them freely; Trinity cannot reboot mid-test, so fsync durability on each platform is out of scope. (b) Network protocol drift — the Rust backend mock may drift from the real servers; a separate Heirloom suite uses the same seeded-PRNG discipline but talks to a real Dropbox backend (weaker determinism, ~100× slower than Trinity). (c) Minimization doesn't scale end-to-end — even a small perturbation to the initial trees changes RPC scheduling downstream and invalidates a hard-won seed. Mitigation under consideration: decouple the global PRNG into per-subsystem PRNGs so perturbing one doesn't invalidate the others.

Architectural numbers & shapes

  • Sync Engine Classic age: 12+ years running in production at time of writing.
  • Deployment target: "hundreds of millions of user machines," each a "wildly different environment."
  • Random-test volume: tens of millions of runs per night.
  • Determinism contract: (seed, commit hash) → exact execution trace.
  • Trinity Native vs Trinity: ~10× slower on native filesystem than in-memory mock.
  • Heirloom vs Trinity: ~100× slower (talks to real backend).
  • Termination heuristic: 200 planning iterations cap.

Concepts introduced / reinforced

  • concepts/deterministic-simulationnew. (seed, commit) → reproducible full-system trace; the canonical production-engineering discipline behind CanopyCheck + Trinity + FoundationDB simulation + TigerBeetle VOPR.
  • concepts/design-away-invalid-statesnew. Core Nucleus tenet: when invalid states exist in the data model / protocol, real-bug detection is impossible (orphan-node example).
  • concepts/merge-base-three-tree-syncnew. Three-tree (Remote / Local / Synced) observations model that makes sync direction computable and the convergence goal cleanly expressible.
  • concepts/test-case-minimizationnew. QuickCheck-style iterative shrinking of a failing input (trees minus nodes); Trinity's scope limit ("everything changes if you perturb anything") is the contrast case.
  • concepts/lightweight-formal-verificationreinforced. Same family as ShardStore's executable spec: "the test is a well-defined property of the system" rather than "the test reimplements the system."
  • concepts/memory-safetyreinforced. Rust mentioned as the implementation substrate; follow-up post promised on "how we leverage Rust's type system" for the design-away-invalid-states principle.

Patterns introduced / reinforced

  • patterns/property-based-testingnew. QuickCheck-lineage: randomize typed inputs; assert invariants rather than specific outputs; shrink on failure.
  • patterns/seed-recorded-failure-reproducibilitynew. On failure, output (seed, commit-hash). Engineer adds logging inline and re-runs — "guaranteed to fail again." The developer-experience contract that keeps randomized testing usable at tens-of-millions of runs per night.
  • patterns/single-threaded-with-offloadnew. Main control thread + dedicated offload pools; serializable under test so the whole engine runs deterministic-single-threaded. The architectural substrate beneath CanopyCheck and Trinity.

Systems named

  • systems/dropbox-nucleus — the new sync engine (Rust; three-tree data model; single control thread + offload; impl Future all the way down).
  • systems/dropbox-sync-engine-classic — the legacy system, the "why this rewrite was hard" foil.
  • systems/canopycheck — QuickCheck-inspired randomized property testing for the planner only.
  • systems/trinity — end-to-end randomized testing with mocked FS/network/time; custom Rust futures executor drives Nucleus.
  • systems/heirloom — separate test suite talking to a real Dropbox backend (~100× slower than Trinity); deferred to a future post.

Cross-references

Caveats

  • Timing: post is dated 2024-05-31 but describes work done for the Nucleus rewrite which shipped around 2020 (the URL slug hedges "2020"). Architectural choices described should be read as 2020-era Dropbox, lightly updated.
  • No hard bug-find counts beyond the Archives/Drafts/January cycle anecdote; "enormous number of bugs" is qualitative.
  • No defect-escape-rate comparison between Classic and Nucleus; the claim is that the rewrite didn't regress Classic's stability, not that it measurably improved a metric.
  • Heirloom internals are deferred to a future post.
  • Rust type-system exploitation for invalid-state-design-away is also deferred to a future post. (The prior post on the rewrite decision is linked: Rewriting the heart of our sync engine.)
  • Not a vendor pitch — author is named (Isaac Goldberg); special thanks to Ben Blum, Sujay Jayakar, Geoff Song, John Lai, Liam Whelan, Gautam Gupta, Iulia Tamas, Sync team. HN discussion at 68 points.

Source

Last updated · 200 distilled / 1,178 read