PATTERN Cited by 3 sources

Metadata + chunk storage stack¶

Problem¶

A storage system needs:

ACID correctness on "what exists where" (so readers and writers agree).
Massive-scale, cheap durability on the actual bytes.
Independent scaling of the two (metadata ops/sec ≠ byte throughput).
Pluggable replacement of either tier without disrupting the other.

A single-system design (either ACID-over-bytes or eventual- consistent-over-everything) fails at least one of these.

Pattern¶

Split storage into two tiers operated independently:

Metadata tier — a small transactional database storing the map {logical address} → {chunk id(s), chunk store location, chunk offsets, version, attrs}. ACID or near-ACID. Substrate choices: Redis, SQLite, Postgres, MySQL, FoundationDB, TiKV, etc.
Chunk tier — content-addressed or opaque-ID byte storage. Chunks are immutable once written. Substrate choices: S3-compatible object storage, HDFS, internal blob services.

All reads and writes at the filesystem/volume/database API compose: (a) one or more metadata operations, then (b) zero or more chunk operations. Consistency lives in the metadata tier; bytes are eventually-consistent-at-scale across the chunk tier without correctness impact.

See concepts/metadata-data-split-storage for the concept- level discussion.

Canonical wiki instances¶

Fly.io Sprites (2026-01-14)¶

"The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data ('chunks') and metadata (a map of where the 'chunks' are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage."

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Metadata tier: SQLite + Litestream. Chunk tier: S3-compatible object store. Shape: per-Sprite POSIX filesystem.

JuiceFS (upstream)¶

JuiceFS is the named inspiration — a POSIX FS with pluggable metadata backends (Redis, MySQL, Postgres, TiKV, SQLite) over any S3-shape object store. Fly.io forked it; the architectural pattern is the same.

Tigris (2024-02-15)¶

Metadata tier: FoundationDB clusters per region. Chunk tier: NVMe byte cache per region + optional S3-compat backend. Shape: S3-compat object store.

LiteFS (adjacent variant)¶

Metadata tier: Primary-node-managed lease + LTX log. Chunk tier: LTX frames shipped across replicas. Shape: FUSE-based SQLite replication. (Not a pure instance — LTX frames are both metadata and data — but architecturally adjacent: consistency is concentrated in the lease + log; bytes flow async across replicas.)

HDFS / Colossus / MooseFS / Lustre¶

Metadata-server + block-server shape at datacenter scale. Named servers, named blocks, block replication policy lives in the metadata server. Same pattern different scale.

Why the split keeps winning¶

Consistency is small, bytes are big. A single-digit-GB metadata DB can handle petabytes of bytes' worth of transactional ordering.
Bytes want async replication. Chunk tiers scale horizontally by adding replicas/regions, with metadata as the source of truth for placement.
Independent operational knobs. Metadata DB backups/restore/failover is one playbook; object-store ops is another. No single playbook has to cover both.
Cheap snapshots and forks. [[concepts/fast-checkpoint- via-metadata-shuffle|Snapshot = clone metadata; restore = re-point to the clone]]. Chunks are shared across snapshots for free.
Control / data-plane separation at the layer boundary. Metadata is the control plane; bytes are the data plane.

Recursive application: object-store-rooted metadata tier (2026-02-04)¶

The default assumption of this pattern is that the metadata tier is a traditional DB on local-ish durable storage (Redis on a replica set, FoundationDB on a regional cluster, SQLite on a durable volume). sources/2026-02-04-flyio-litestream-writable-vfs introduces a variant where the metadata tier itself applies this pattern recursively: Fly.io's Sprite "block map" is a JuiceFS metadata backend running on SQLite + Litestream VFS in writable + hydration mode — i.e., the metadata tier is itself object-store-rooted, served via HTTP Range GETs against LTX files with a background-hydrated local file for steady state.

Concretely:

Sprite user data (files)
    ├── metadata: which chunks are in which file
    │       ├── metadata-of-metadata: SQLite page index
    │       │       └── stored in LTX files on object storage
    │       │           (Range GET + LRU cache + background hydrate)
    │       └── stored in SQLite (Litestream VFS)
    └── chunks: on object storage (JuiceFS)

Both the user bytes and the per-Sprite filesystem metadata root at object storage. The local host has no durability responsibility — Sprite migration is a pointer-move, no block replication required.

The recursive shape works because:

Block maps are small ("low tens of megabytes worst case"), so Litestream VFS's LTX-page-lookup + Range-GET model is cost-effective.
Single-writer semantics hold trivially — each Sprite has one VM writing to its block map.
The hydration mode bounds steady-state read latency to local-file speed once the VFS has hydrated.
Cold boot cost (Range-GETs from object storage while serving an incoming HTTP request) is tolerable because block maps are small.

This is a distinct deployment shape of the pattern worth calling out — the canonical choice axes table above extends with a third row at the metadata substrate column: Shape 3 = "Strongly-consistent DB hosted in object storage via a page-level-read VFS" (Litestream VFS).

Canonical choice axes¶

Axis	Shape 1	Shape 2
Metadata substrate	Strongly-consistent DB (FDB, SQLite)	Eventually-consistent KV (S3 ETags, DynamoDB)
Chunk substrate	S3 / GCS / Azure Blob	Local NVMe + async replicate
Chunk addressing	Content-hash (content-addressable)	Opaque IDs with version field
Filesystem shape	POSIX (JuiceFS, Sprites)	Object store (Tigris, S3)
Replica strategy	Demand-driven (Tigris)	Pre-placed (HDFS replication)

Trade-offs¶

Two-system ops burden. Strictly worse than a one-system design on the ops axis.
Cross-tier skew windows. Metadata says "bytes exist" a few ms before bytes actually arrive on some replica. Readers must handle miss-then-retry.
Metadata is the scaling bottleneck. Write TPS through the metadata DB is the overall ceiling.
Chunk GC is non-trivial. Unreferenced chunks pile up if metadata-driven GC doesn't run regularly.
Debugging across two systems. "Is the problem metadata or bytes?" is a new question ops teams have to answer.
Pluggable metadata backend != free swap. The Sprites team "rewrote" the SQLite metadata backend for JuiceFS; swapping between Redis / SQLite / FoundationDB is real work.

Relation to patterns/metadata-db-plus-object-cache-tier ¶

The two patterns are close relatives:

Metadata + chunk storage stack (this pattern) — general architectural shape: two tiers, independent substrates. Covers filesystems, object stores, volumes, databases.
Metadata-db + object-cache tier — specialised to a global object store with per-region cache tier (Tigris shape). One specific deployment shape of this pattern.

Seen in¶

[[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — Sprites canonical.
[[sources/2024-02-15-flyio-globally-distributed-object- storage-with-tigris]] — Tigris canonical.
sources/2026-02-04-flyio-litestream-writable-vfs — the recursive-split variant. Sprite block map = SQLite + Litestream VFS in writable + hydration mode, making the metadata tier itself object-store-rooted. First wiki instance where the pattern is applied to its own metadata tier.

systems/fly-sprites
systems/juicefs
systems/tigris
systems/litestream
systems/litestream-vfs — the object-store-rooted metadata-tier implementation that enables the recursive-split variant.
systems/litefs
systems/sqlite
systems/foundationdb
systems/aws-s3
concepts/metadata-data-split-storage
concepts/object-storage-as-disk-root
concepts/immutable-object-storage
patterns/read-through-object-store-volume
patterns/metadata-db-plus-object-cache-tier
patterns/vfs-range-get-from-object-store — the mechanism that makes the recursive-split variant's metadata-tier viable.
concepts/async-clone-hydration — the hydration shape that covers steady-state reads on the recursive-metadata-tier.
companies/flyio