Just make it scale: An Aurora DSQL story (Werner Vogels, guest-authored by Niko Matsakis & Marc Bowes)¶

Summary¶

Werner Vogels hosts a guest post by Sr. Principal Engineers Niko Matsakis (a core Rust language designer) and Marc Bowes on the engineering journey of systems/aurora-dsql — a serverless distributed SQL database launched at re:Invent 2024. The piece is a retrospective on two architectural bets that defined the project: (1) how to scale writes horizontally without two-phase commit, landing on a single-journal-per-transaction design with a novel Crossbar subscription router between journals and storage; and (2) why the team moved from 100% JVM (Kotlin) to 100% Rust across both data and control planes — driven by tail-latency math at scale (the probability that a transaction hits at least one GC pause approaches 1) and memory-safety economics (new code contributes the overwhelming majority of new vulnerabilities). DSQL extends PostgreSQL via its public extension API rather than forking it.

Architecture¶

Goal: serverless, horizontally scaling relational DB with no infra management; multi-region; zero ops overhead. Positioned as the next step after Aurora (cloud-optimized storage) and Aurora Serverless (vertical scaling).
Decomposition: "bite-sized chunks with clear interfaces and explicit contracts." Unix-mantra modules combining to provide full ACID.
Read path (solved 2021): not detailed in this post.
Write path: the 2024 breakthrough.
Rejected design: 2PC-per-journal with row-range ownership. Fine on the happy path, but timeouts, liveness, rollback, coordinator failure make operational complexity compound.
Chosen design: entire commit goes into a single journal, no matter how many rows it touches. This gives atomic + durable trivially.
Consequence: read path now must check all journals — any journal may hold the latest value for a row. Storage needs connections to every journal → network bandwidth wall as journals scale.
Crossbar (solves the read-path blow-up):
Sits between journals and storage.
Subscription API: storage nodes subscribe to key ranges.
Crossbar follows each journal (each ordered by transaction time) and assembles a total order, routing updates to subscribed storage nodes.
Separates write-path scaling from read-path scaling.
Adjudicator: front of the journal, ensures only one transaction wins on conflicts. (Used as the pilot for the Rust migration — see below.)

Numbers to remember¶

JVM Kotlin Adjudicator: 2,000 → 3,000 TPS after years of incremental tuning.
Rust Adjudicator (rewrite by Java devs, no tuning): 30,000 TPS — 10× improvement.
Crossbar simulation, 40 hosts, modeling 1-second GC stalls: expected ~1M TPS, got ~6K TPS; tail latency blew up from 1s → 10s. Fundamental architecture-level problem: every transaction reads from every host, so at scale, P(at least one GC pause per transaction) → 1.
Post-Rust rewrite: p99 tracks close to p50 — tail latency consistent with median.
Postgres: started 1986, >1M lines of C, thousands of contributors.

Key takeaways¶

Scaling writes: avoid 2PC, pay the read-path price. Instead of sharding rows across journals and paying 2PC cost on multi-journal transactions, DSQL writes the entire commit into a single journal. This trades cheap writes for a heavier read path, but the read path can be solved by a dedicated component (Crossbar) that fans out subscribed updates to storage. (Source: body, "Scaling the Journal layer".)
The Crossbar is a pub-sub total-order router. Storage nodes subscribe to key ranges; Crossbar follows each journal (already ordered by transaction time) and composes the global total order, routing updates to subscribers. See systems/aurora-dsql for the detailed component breakdown.
At scale, tail latency becomes the system's latency. If every transaction has to touch N hosts and each host has some probability p of a GC pause, the probability any host stalls during the transaction approaches 1 as N grows. DSQL's 40-host simulation hit 0.6% of target TPS and 10× the target tail. This is the Marc Brooker "tail at scale" result, forcing a language choice. See concepts/tail-latency-at-scale. (Source: body, "Scaling the Journal layer".)
Rust chosen over JVM-tuning or C/C++ because it delivered (a) predictable perf without GC, (b) memory safety without giving up control, (c) zero-cost abstractions. The JVM optimization path was viable but bounded; the C/C++ path was viable but unsafe. The 10× Adjudicator result settled the question: team stopped asking "should we use Rust?" and started asking "where else can Rust help?"
Pilot-project migration: start small, measure, expand. DSQL did not rewrite the whole data plane first. They picked the Adjudicator (simplest component, existed in both JVM and Rust-friendly context, a Rust journal client already existed) as the pilot. See patterns/pilot-component-language-migration. This "one-way door" decision was de-risked by building proof at the cheapest blast radius first.
Postgres via extensions, not a fork. Postgres is designed extensible; the team used its public extension API rather than hard-forking. Forks start with good intentions and "slowly drift into maintenance nightmares." DSQL replaced replication, concurrency control, durability, storage, session management — all via extensions that live in their own packages/files. See patterns/postgres-extension-over-fork. (Source: body, "It's easier to fix one hard problem…".)
Rust for the extensions, not C. Initial instinct: C for lowest impedance mismatch with Postgres. Reality: Postgres C is battle-tested; new code is where new bugs live. Google Android published data that most new memory-safety bugs come from new code — matching the team's observation. Rust lets the team build safe abstractions (e.g. a single String encapsulating char* + len invariants that Postgres header comments warn about) and encode them in the type system. See concepts/memory-safety. (Source: body, "It's easier to fix one hard problem…".)
The control-plane-data-plane split got re-litigated. Initial split: Kotlin for control plane (CRUD-ish, internal libs available, devs productive), Rust for data plane (latency-critical). It seemed best-of-both-worlds. It was wrong:
Control plane actually shares logic with data plane (hot-cluster detection, topology changes).
Different languages → can't share a library → Kotlin and Rust versions drift → every drift produces a debug-fix-deploy cycle.
Can't share simulation tooling across Kotlin + Rust.
Resolution: rewrite the control plane in Rust. By this time (a year later), Rust's 2021 edition had addressed early paper-cuts; internal lib support had caught up (AWS Authentication Runtime Rust client actually outperformed its Java version); API Gateway + Lambda had absorbed integration concerns. Team asked "when can we start?" not "do we have to?". See concepts/control-plane-data-plane-separation (⚠️ contradiction section added there).
Ramp investment matters as much as the language. "The DSQL Book" (internal guide by Marc Brooker) walked devs from philosophy through deferred-choices; weekly learning sessions on distributed computing, paper reviews, architectural deep-dives; Niko Matsakis (Rust language designer) worked through thorny problems before code was written. This is why the JVM-native team shipped 10× Adjudicator perf on first Rust attempt.
Rust is not universally right. JDK21 is good enough for most services. The post explicitly caveats: if tail latency is critical and your team can invest, choose Rust; if you're the only Rust team in a Java-standardized org, weigh the isolation cost. The choice is architectural, not ideological.

Caveats / limits¶

The read-path design of 2021 is referenced but not detailed. The post is a write-path + language-choice retrospective, not a full architecture paper. For detail, the post recommends Marc Brooker's "DSQL Vignettes" series.
Exact Crossbar internals (buffering strategy, how subscribers "fall behind" is handled, backpressure) are gestured at but not specified.
The 10× Adjudicator number compares "years-tuned Kotlin" against "first-cut Rust, no tuning." It is not a general claim about JVM vs Rust — it's specific to this workload (low-overhead, latency-sensitive, hot loop).
Control plane split-vs-unified is an explicit ⚠️ contradiction: the post retracts its earlier "Kotlin control plane, Rust data plane" framing in favor of unified Rust.