PATTERN Cited by 3 sources
Metadata + chunk storage stack¶
Problem¶
A storage system needs:
- ACID correctness on "what exists where" (so readers and writers agree).
- Massive-scale, cheap durability on the actual bytes.
- Independent scaling of the two (metadata ops/sec ≠ byte throughput).
- Pluggable replacement of either tier without disrupting the other.
A single-system design (either ACID-over-bytes or eventual- consistent-over-everything) fails at least one of these.
Pattern¶
Split storage into two tiers operated independently:
- Metadata tier — a small transactional database storing
the map
{logical address} → {chunk id(s), chunk store location, chunk offsets, version, attrs}. ACID or near-ACID. Substrate choices: Redis, SQLite, Postgres, MySQL, FoundationDB, TiKV, etc. - Chunk tier — content-addressed or opaque-ID byte storage. Chunks are immutable once written. Substrate choices: S3-compatible object storage, HDFS, internal blob services.
All reads and writes at the filesystem/volume/database API compose: (a) one or more metadata operations, then (b) zero or more chunk operations. Consistency lives in the metadata tier; bytes are eventually-consistent-at-scale across the chunk tier without correctness impact.
See concepts/metadata-data-split-storage for the concept- level discussion.
Canonical wiki instances¶
Fly.io Sprites (2026-01-14)¶
"The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data ('chunks') and metadata (a map of where the 'chunks' are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage."
(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])
Metadata tier: SQLite + Litestream. Chunk tier: S3-compatible object store. Shape: per-Sprite POSIX filesystem.
JuiceFS (upstream)¶
JuiceFS is the named inspiration — a POSIX FS with pluggable metadata backends (Redis, MySQL, Postgres, TiKV, SQLite) over any S3-shape object store. Fly.io forked it; the architectural pattern is the same.
Tigris (2024-02-15)¶
Metadata tier: FoundationDB clusters per region. Chunk tier: NVMe byte cache per region + optional S3-compat backend. Shape: S3-compat object store.
LiteFS (adjacent variant)¶
Metadata tier: Primary-node-managed lease + LTX log. Chunk tier: LTX frames shipped across replicas. Shape: FUSE-based SQLite replication. (Not a pure instance — LTX frames are both metadata and data — but architecturally adjacent: consistency is concentrated in the lease + log; bytes flow async across replicas.)
HDFS / Colossus / MooseFS / Lustre¶
Metadata-server + block-server shape at datacenter scale. Named servers, named blocks, block replication policy lives in the metadata server. Same pattern different scale.
Why the split keeps winning¶
- Consistency is small, bytes are big. A single-digit-GB metadata DB can handle petabytes of bytes' worth of transactional ordering.
- Bytes want async replication. Chunk tiers scale horizontally by adding replicas/regions, with metadata as the source of truth for placement.
- Independent operational knobs. Metadata DB backups/restore/failover is one playbook; object-store ops is another. No single playbook has to cover both.
- Cheap snapshots and forks. [[concepts/fast-checkpoint- via-metadata-shuffle|Snapshot = clone metadata; restore = re-point to the clone]]. Chunks are shared across snapshots for free.
- Control / data-plane separation at the layer boundary. Metadata is the control plane; bytes are the data plane.
Recursive application: object-store-rooted metadata tier (2026-02-04)¶
The default assumption of this pattern is that the metadata tier is a traditional DB on local-ish durable storage (Redis on a replica set, FoundationDB on a regional cluster, SQLite on a durable volume). sources/2026-02-04-flyio-litestream-writable-vfs introduces a variant where the metadata tier itself applies this pattern recursively: Fly.io's Sprite "block map" is a JuiceFS metadata backend running on SQLite + Litestream VFS in writable + hydration mode — i.e., the metadata tier is itself object-store-rooted, served via HTTP Range GETs against LTX files with a background-hydrated local file for steady state.
Concretely:
Sprite user data (files)
├── metadata: which chunks are in which file
│ ├── metadata-of-metadata: SQLite page index
│ │ └── stored in LTX files on object storage
│ │ (Range GET + LRU cache + background hydrate)
│ └── stored in SQLite (Litestream VFS)
└── chunks: on object storage (JuiceFS)
Both the user bytes and the per-Sprite filesystem metadata root at object storage. The local host has no durability responsibility — Sprite migration is a pointer-move, no block replication required.
The recursive shape works because:
- Block maps are small ("low tens of megabytes worst case"), so Litestream VFS's LTX-page-lookup + Range-GET model is cost-effective.
- Single-writer semantics hold trivially — each Sprite has one VM writing to its block map.
- The hydration mode bounds steady-state read latency to local-file speed once the VFS has hydrated.
- Cold boot cost (Range-GETs from object storage while serving an incoming HTTP request) is tolerable because block maps are small.
This is a distinct deployment shape of the pattern worth calling out — the canonical choice axes table above extends with a third row at the metadata substrate column: Shape 3 = "Strongly-consistent DB hosted in object storage via a page-level-read VFS" (Litestream VFS).
Canonical choice axes¶
| Axis | Shape 1 | Shape 2 |
|---|---|---|
| Metadata substrate | Strongly-consistent DB (FDB, SQLite) | Eventually-consistent KV (S3 ETags, DynamoDB) |
| Chunk substrate | S3 / GCS / Azure Blob | Local NVMe + async replicate |
| Chunk addressing | Content-hash (content-addressable) | Opaque IDs with version field |
| Filesystem shape | POSIX (JuiceFS, Sprites) | Object store (Tigris, S3) |
| Replica strategy | Demand-driven (Tigris) | Pre-placed (HDFS replication) |
Trade-offs¶
- Two-system ops burden. Strictly worse than a one-system design on the ops axis.
- Cross-tier skew windows. Metadata says "bytes exist" a few ms before bytes actually arrive on some replica. Readers must handle miss-then-retry.
- Metadata is the scaling bottleneck. Write TPS through the metadata DB is the overall ceiling.
- Chunk GC is non-trivial. Unreferenced chunks pile up if metadata-driven GC doesn't run regularly.
- Debugging across two systems. "Is the problem metadata or bytes?" is a new question ops teams have to answer.
- Pluggable metadata backend != free swap. The Sprites team "rewrote" the SQLite metadata backend for JuiceFS; swapping between Redis / SQLite / FoundationDB is real work.
Relation to patterns/metadata-db-plus-object-cache-tier¶
The two patterns are close relatives:
- Metadata + chunk storage stack (this pattern) — general architectural shape: two tiers, independent substrates. Covers filesystems, object stores, volumes, databases.
- Metadata-db + object-cache tier — specialised to a global object store with per-region cache tier (Tigris shape). One specific deployment shape of this pattern.
Seen in¶
- [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — Sprites canonical.
- [[sources/2024-02-15-flyio-globally-distributed-object- storage-with-tigris]] — Tigris canonical.
- sources/2026-02-04-flyio-litestream-writable-vfs — the recursive-split variant. Sprite block map = SQLite + Litestream VFS in writable + hydration mode, making the metadata tier itself object-store-rooted. First wiki instance where the pattern is applied to its own metadata tier.
Related¶
- systems/fly-sprites
- systems/juicefs
- systems/tigris
- systems/litestream
- systems/litestream-vfs — the object-store-rooted metadata-tier implementation that enables the recursive-split variant.
- systems/litefs
- systems/sqlite
- systems/foundationdb
- systems/aws-s3
- concepts/metadata-data-split-storage
- concepts/object-storage-as-disk-root
- concepts/immutable-object-storage
- patterns/read-through-object-store-volume
- patterns/metadata-db-plus-object-cache-tier
- patterns/vfs-range-get-from-object-store — the mechanism that makes the recursive-split variant's metadata-tier viable.
- concepts/async-clone-hydration — the hydration shape that covers steady-state reads on the recursive-metadata-tier.
- companies/flyio