Just make it scale: An Aurora DSQL story (Werner Vogels, guest-authored by Niko Matsakis & Marc Bowes)¶
Summary¶
Werner Vogels hosts a guest post by Sr. Principal Engineers Niko Matsakis (a core Rust language designer) and Marc Bowes on the engineering journey of systems/aurora-dsql — a serverless distributed SQL database launched at re:Invent 2024. The piece is a retrospective on two architectural bets that defined the project: (1) how to scale writes horizontally without two-phase commit, landing on a single-journal-per-transaction design with a novel Crossbar subscription router between journals and storage; and (2) why the team moved from 100% JVM (Kotlin) to 100% Rust across both data and control planes — driven by tail-latency math at scale (the probability that a transaction hits at least one GC pause approaches 1) and memory-safety economics (new code contributes the overwhelming majority of new vulnerabilities). DSQL extends PostgreSQL via its public extension API rather than forking it.
Architecture¶
- Goal: serverless, horizontally scaling relational DB with no infra management; multi-region; zero ops overhead. Positioned as the next step after Aurora (cloud-optimized storage) and Aurora Serverless (vertical scaling).
- Decomposition: "bite-sized chunks with clear interfaces and explicit contracts." Unix-mantra modules combining to provide full ACID.
- Read path (solved 2021): not detailed in this post.
- Write path: the 2024 breakthrough.
- Rejected design: 2PC-per-journal with row-range ownership. Fine on the happy path, but timeouts, liveness, rollback, coordinator failure make operational complexity compound.
- Chosen design: entire commit goes into a single journal, no matter how many rows it touches. This gives atomic + durable trivially.
- Consequence: read path now must check all journals — any journal may hold the latest value for a row. Storage needs connections to every journal → network bandwidth wall as journals scale.
- Crossbar (solves the read-path blow-up):
- Sits between journals and storage.
- Subscription API: storage nodes subscribe to key ranges.
- Crossbar follows each journal (each ordered by transaction time) and assembles a total order, routing updates to subscribed storage nodes.
- Separates write-path scaling from read-path scaling.
- Adjudicator: front of the journal, ensures only one transaction wins on conflicts. (Used as the pilot for the Rust migration — see below.)
Numbers to remember¶
- JVM Kotlin Adjudicator: 2,000 → 3,000 TPS after years of incremental tuning.
- Rust Adjudicator (rewrite by Java devs, no tuning): 30,000 TPS — 10× improvement.
- Crossbar simulation, 40 hosts, modeling 1-second GC stalls: expected ~1M TPS, got ~6K TPS; tail latency blew up from 1s → 10s. Fundamental architecture-level problem: every transaction reads from every host, so at scale, P(at least one GC pause per transaction) → 1.
- Post-Rust rewrite: p99 tracks close to p50 — tail latency consistent with median.
- Postgres: started 1986, >1M lines of C, thousands of contributors.
Key takeaways¶
-
Scaling writes: avoid 2PC, pay the read-path price. Instead of sharding rows across journals and paying 2PC cost on multi-journal transactions, DSQL writes the entire commit into a single journal. This trades cheap writes for a heavier read path, but the read path can be solved by a dedicated component (Crossbar) that fans out subscribed updates to storage. (Source: body, "Scaling the Journal layer".)
-
The Crossbar is a pub-sub total-order router. Storage nodes subscribe to key ranges; Crossbar follows each journal (already ordered by transaction time) and composes the global total order, routing updates to subscribers. See systems/aurora-dsql for the detailed component breakdown.
-
At scale, tail latency becomes the system's latency. If every transaction has to touch N hosts and each host has some probability p of a GC pause, the probability any host stalls during the transaction approaches 1 as N grows. DSQL's 40-host simulation hit 0.6% of target TPS and 10× the target tail. This is the Marc Brooker "tail at scale" result, forcing a language choice. See concepts/tail-latency-at-scale. (Source: body, "Scaling the Journal layer".)
-
Rust chosen over JVM-tuning or C/C++ because it delivered (a) predictable perf without GC, (b) memory safety without giving up control, (c) zero-cost abstractions. The JVM optimization path was viable but bounded; the C/C++ path was viable but unsafe. The 10× Adjudicator result settled the question: team stopped asking "should we use Rust?" and started asking "where else can Rust help?"
-
Pilot-project migration: start small, measure, expand. DSQL did not rewrite the whole data plane first. They picked the Adjudicator (simplest component, existed in both JVM and Rust-friendly context, a Rust journal client already existed) as the pilot. See patterns/pilot-component-language-migration. This "one-way door" decision was de-risked by building proof at the cheapest blast radius first.
-
Postgres via extensions, not a fork. Postgres is designed extensible; the team used its public extension API rather than hard-forking. Forks start with good intentions and "slowly drift into maintenance nightmares." DSQL replaced replication, concurrency control, durability, storage, session management — all via extensions that live in their own packages/files. See patterns/postgres-extension-over-fork. (Source: body, "It's easier to fix one hard problem…".)
-
Rust for the extensions, not C. Initial instinct: C for lowest impedance mismatch with Postgres. Reality: Postgres C is battle-tested; new code is where new bugs live. Google Android published data that most new memory-safety bugs come from new code — matching the team's observation. Rust lets the team build safe abstractions (e.g. a single
Stringencapsulatingchar*+leninvariants that Postgres header comments warn about) and encode them in the type system. See concepts/memory-safety. (Source: body, "It's easier to fix one hard problem…".) -
The control-plane-data-plane split got re-litigated. Initial split: Kotlin for control plane (CRUD-ish, internal libs available, devs productive), Rust for data plane (latency-critical). It seemed best-of-both-worlds. It was wrong:
- Control plane actually shares logic with data plane (hot-cluster detection, topology changes).
- Different languages → can't share a library → Kotlin and Rust versions drift → every drift produces a debug-fix-deploy cycle.
- Can't share simulation tooling across Kotlin + Rust.
-
Resolution: rewrite the control plane in Rust. By this time (a year later), Rust's 2021 edition had addressed early paper-cuts; internal lib support had caught up (AWS Authentication Runtime Rust client actually outperformed its Java version); API Gateway + Lambda had absorbed integration concerns. Team asked "when can we start?" not "do we have to?". See concepts/control-plane-data-plane-separation (⚠️ contradiction section added there).
-
Ramp investment matters as much as the language. "The DSQL Book" (internal guide by Marc Brooker) walked devs from philosophy through deferred-choices; weekly learning sessions on distributed computing, paper reviews, architectural deep-dives; Niko Matsakis (Rust language designer) worked through thorny problems before code was written. This is why the JVM-native team shipped 10× Adjudicator perf on first Rust attempt.
-
Rust is not universally right. JDK21 is good enough for most services. The post explicitly caveats: if tail latency is critical and your team can invest, choose Rust; if you're the only Rust team in a Java-standardized org, weigh the isolation cost. The choice is architectural, not ideological.
Caveats / limits¶
- The read-path design of 2021 is referenced but not detailed. The post is a write-path + language-choice retrospective, not a full architecture paper. For detail, the post recommends Marc Brooker's "DSQL Vignettes" series.
- Exact Crossbar internals (buffering strategy, how subscribers "fall behind" is handled, backpressure) are gestured at but not specified.
- The 10× Adjudicator number compares "years-tuned Kotlin" against "first-cut Rust, no tuning." It is not a general claim about JVM vs Rust — it's specific to this workload (low-overhead, latency-sensitive, hot loop).
- Control plane split-vs-unified is an explicit ⚠️ contradiction: the post retracts its earlier "Kotlin control plane, Rust data plane" framing in favor of unified Rust.