Skip to content

SYSTEM Cited by 2 sources

ShardStore

Definition

ShardStore is the bottom-most layer of Amazon S3's storage stack — the per-disk software that manages data on each individual hard drive. A few years before Warfield's FAST '23 keynote, S3 rewrote this layer from scratch. Two choices define it architecturally:

  1. Implementation language: Rust. Chosen for type safety and to get structured language support for catching bugs early. The team went further and wrote libraries that extend Rust's type safety to apply to on-disk structures.
  2. Executable specification in the same repo. Alongside the production ShardStore, the team checks in a Rust model of ShardStore's logic, stripped of hardware and on-disk-layout complexity. It is "about 1% of the size of the real system," functions as a compact executable specification, and is exercised against the real code via property-based testing on every commit.

"It wound up being about 1% of the size of the real system, but allowed us to perform testing at a level that would have been completely impractical to do against a hard drive with 120 available IOPS."

— Andy Warfield, 2025

Why it exists

A single hard drive inside S3 delivers about 120 IOPS (see concepts/hard-drive-physics). A bug at the per-disk layer can cause silent corruption — the most expensive failure mode in a storage service that sells 11 9s of durability. Testing against real drives at 120 IOPS cannot cover enough state space to be confident. An executable specification, by contrast, runs at CPU speed and can be exhaustively exercised with property-based tests.

Guardrail philosophy

ShardStore is described by the post as an instance of the S3 team's preference for coarse-grained guardrails over per-bug mitigations. Rather than track each possible on-disk bug and write a test for it, the team invested in a structural property — "production behavior matches specification behavior" — and built tooling so normal engineers without PhDs in formal methods can maintain that property.

See patterns/durability-review and concepts/lightweight-formal-verification for the broader context.

Published companion

The team published at SOSP: "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" (paper).

Internal structure (Kozlovski 2024 detail)

Kozlovski's 2024 High Scalability explainer (Source: sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale) adds a few specifics on the on-disk design not called out in the Warfield keynote:

  • It's a log-structured merge tree (see concepts/lsm-compaction) — the mainstream choice for write-heavy key-value stores, but an unusual one for a per-disk storage layer that might have otherwise been a bespoke layout.
  • Shard data is stored outside the tree. The tree tracks shard metadata; the shard bytes live in separate regions of the disk. This is a write-amplification optimization — the tree's own compaction moves small metadata entries around, not the large shard payloads. Compaction cost is bounded by metadata size, not data size.
  • Crash consistency via soft updates. Rather than journaling or copy-on-write, ShardStore uses a soft-updates-style protocol (disk writes carefully ordered so that any crash state is recoverable) — the same technique popularized by FreeBSD UFS2.
  • Designed ground-up for concurrency and HDD IO. Operation scheduling, coalescing, and request reordering to turn random HDD I/O into more-sequential access wherever the workload allows.
  • Originally ~40,000 lines of Rust. A compact system for what it does — fits the "small enough to formally verify" precondition that makes the executable-spec approach work.

These are the kinds of design choices that pair with the executable-spec approach: each one (LSM, external shard data, soft updates, concurrency-first IO) is a well-documented primitive whose correctness is tractable to model.

What ShardStore is not

  • Not the whole of S3. It is one layer among "hundreds of microservices." See systems/aws-s3.
  • Not a data format or replication engine — those live above it. ShardStore just owns the disk.
  • Not tied to any particular redundancy scheme (see concepts/erasure-coding).

Seen in

Last updated · 542 distilled / 1,571 read