Skip to content

Just make it scale: An Aurora DSQL story (Werner Vogels, guest-authored by Niko Matsakis & Marc Bowes)

Summary

Werner Vogels hosts a guest post by Sr. Principal Engineers Niko Matsakis (a core Rust language designer) and Marc Bowes on the engineering journey of systems/aurora-dsql — a serverless distributed SQL database launched at re:Invent 2024. The piece is a retrospective on two architectural bets that defined the project: (1) how to scale writes horizontally without two-phase commit, landing on a single-journal-per-transaction design with a novel Crossbar subscription router between journals and storage; and (2) why the team moved from 100% JVM (Kotlin) to 100% Rust across both data and control planes — driven by tail-latency math at scale (the probability that a transaction hits at least one GC pause approaches 1) and memory-safety economics (new code contributes the overwhelming majority of new vulnerabilities). DSQL extends PostgreSQL via its public extension API rather than forking it.

Architecture

  • Goal: serverless, horizontally scaling relational DB with no infra management; multi-region; zero ops overhead. Positioned as the next step after Aurora (cloud-optimized storage) and Aurora Serverless (vertical scaling).
  • Decomposition: "bite-sized chunks with clear interfaces and explicit contracts." Unix-mantra modules combining to provide full ACID.
  • Read path (solved 2021): not detailed in this post.
  • Write path: the 2024 breakthrough.
  • Rejected design: 2PC-per-journal with row-range ownership. Fine on the happy path, but timeouts, liveness, rollback, coordinator failure make operational complexity compound.
  • Chosen design: entire commit goes into a single journal, no matter how many rows it touches. This gives atomic + durable trivially.
  • Consequence: read path now must check all journals — any journal may hold the latest value for a row. Storage needs connections to every journal → network bandwidth wall as journals scale.
  • Crossbar (solves the read-path blow-up):
  • Sits between journals and storage.
  • Subscription API: storage nodes subscribe to key ranges.
  • Crossbar follows each journal (each ordered by transaction time) and assembles a total order, routing updates to subscribed storage nodes.
  • Separates write-path scaling from read-path scaling.
  • Adjudicator: front of the journal, ensures only one transaction wins on conflicts. (Used as the pilot for the Rust migration — see below.)

Numbers to remember

  • JVM Kotlin Adjudicator: 2,000 → 3,000 TPS after years of incremental tuning.
  • Rust Adjudicator (rewrite by Java devs, no tuning): 30,000 TPS — 10× improvement.
  • Crossbar simulation, 40 hosts, modeling 1-second GC stalls: expected ~1M TPS, got ~6K TPS; tail latency blew up from 1s → 10s. Fundamental architecture-level problem: every transaction reads from every host, so at scale, P(at least one GC pause per transaction) → 1.
  • Post-Rust rewrite: p99 tracks close to p50 — tail latency consistent with median.
  • Postgres: started 1986, >1M lines of C, thousands of contributors.

Key takeaways

  1. Scaling writes: avoid 2PC, pay the read-path price. Instead of sharding rows across journals and paying 2PC cost on multi-journal transactions, DSQL writes the entire commit into a single journal. This trades cheap writes for a heavier read path, but the read path can be solved by a dedicated component (Crossbar) that fans out subscribed updates to storage. (Source: body, "Scaling the Journal layer".)

  2. The Crossbar is a pub-sub total-order router. Storage nodes subscribe to key ranges; Crossbar follows each journal (already ordered by transaction time) and composes the global total order, routing updates to subscribers. See systems/aurora-dsql for the detailed component breakdown.

  3. At scale, tail latency becomes the system's latency. If every transaction has to touch N hosts and each host has some probability p of a GC pause, the probability any host stalls during the transaction approaches 1 as N grows. DSQL's 40-host simulation hit 0.6% of target TPS and 10× the target tail. This is the Marc Brooker "tail at scale" result, forcing a language choice. See concepts/tail-latency-at-scale. (Source: body, "Scaling the Journal layer".)

  4. Rust chosen over JVM-tuning or C/C++ because it delivered (a) predictable perf without GC, (b) memory safety without giving up control, (c) zero-cost abstractions. The JVM optimization path was viable but bounded; the C/C++ path was viable but unsafe. The 10× Adjudicator result settled the question: team stopped asking "should we use Rust?" and started asking "where else can Rust help?"

  5. Pilot-project migration: start small, measure, expand. DSQL did not rewrite the whole data plane first. They picked the Adjudicator (simplest component, existed in both JVM and Rust-friendly context, a Rust journal client already existed) as the pilot. See patterns/pilot-component-language-migration. This "one-way door" decision was de-risked by building proof at the cheapest blast radius first.

  6. Postgres via extensions, not a fork. Postgres is designed extensible; the team used its public extension API rather than hard-forking. Forks start with good intentions and "slowly drift into maintenance nightmares." DSQL replaced replication, concurrency control, durability, storage, session management — all via extensions that live in their own packages/files. See patterns/postgres-extension-over-fork. (Source: body, "It's easier to fix one hard problem…".)

  7. Rust for the extensions, not C. Initial instinct: C for lowest impedance mismatch with Postgres. Reality: Postgres C is battle-tested; new code is where new bugs live. Google Android published data that most new memory-safety bugs come from new code — matching the team's observation. Rust lets the team build safe abstractions (e.g. a single String encapsulating char* + len invariants that Postgres header comments warn about) and encode them in the type system. See concepts/memory-safety. (Source: body, "It's easier to fix one hard problem…".)

  8. The control-plane-data-plane split got re-litigated. Initial split: Kotlin for control plane (CRUD-ish, internal libs available, devs productive), Rust for data plane (latency-critical). It seemed best-of-both-worlds. It was wrong:

  9. Control plane actually shares logic with data plane (hot-cluster detection, topology changes).
  10. Different languages → can't share a library → Kotlin and Rust versions drift → every drift produces a debug-fix-deploy cycle.
  11. Can't share simulation tooling across Kotlin + Rust.
  12. Resolution: rewrite the control plane in Rust. By this time (a year later), Rust's 2021 edition had addressed early paper-cuts; internal lib support had caught up (AWS Authentication Runtime Rust client actually outperformed its Java version); API Gateway + Lambda had absorbed integration concerns. Team asked "when can we start?" not "do we have to?". See concepts/control-plane-data-plane-separation (⚠️ contradiction section added there).

  13. Ramp investment matters as much as the language. "The DSQL Book" (internal guide by Marc Brooker) walked devs from philosophy through deferred-choices; weekly learning sessions on distributed computing, paper reviews, architectural deep-dives; Niko Matsakis (Rust language designer) worked through thorny problems before code was written. This is why the JVM-native team shipped 10× Adjudicator perf on first Rust attempt.

  14. Rust is not universally right. JDK21 is good enough for most services. The post explicitly caveats: if tail latency is critical and your team can invest, choose Rust; if you're the only Rust team in a Java-standardized org, weigh the isolation cost. The choice is architectural, not ideological.

Caveats / limits

  • The read-path design of 2021 is referenced but not detailed. The post is a write-path + language-choice retrospective, not a full architecture paper. For detail, the post recommends Marc Brooker's "DSQL Vignettes" series.
  • Exact Crossbar internals (buffering strategy, how subscribers "fall behind" is handled, backpressure) are gestured at but not specified.
  • The 10× Adjudicator number compares "years-tuned Kotlin" against "first-cut Rust, no tuning." It is not a general claim about JVM vs Rust — it's specific to this workload (low-overhead, latency-sensitive, hot loop).
  • Control plane split-vs-unified is an explicit ⚠️ contradiction: the post retracts its earlier "Kotlin control plane, Rust data plane" framing in favor of unified Rust.

Source

Last updated · 200 distilled / 1,178 read