Skip to content

PATTERN Cited by 3 sources

Metadata + chunk storage stack

Problem

A storage system needs:

  • ACID correctness on "what exists where" (so readers and writers agree).
  • Massive-scale, cheap durability on the actual bytes.
  • Independent scaling of the two (metadata ops/sec ≠ byte throughput).
  • Pluggable replacement of either tier without disrupting the other.

A single-system design (either ACID-over-bytes or eventual- consistent-over-everything) fails at least one of these.

Pattern

Split storage into two tiers operated independently:

  1. Metadata tier — a small transactional database storing the map {logical address} → {chunk id(s), chunk store location, chunk offsets, version, attrs}. ACID or near-ACID. Substrate choices: Redis, SQLite, Postgres, MySQL, FoundationDB, TiKV, etc.
  2. Chunk tier — content-addressed or opaque-ID byte storage. Chunks are immutable once written. Substrate choices: S3-compatible object storage, HDFS, internal blob services.

All reads and writes at the filesystem/volume/database API compose: (a) one or more metadata operations, then (b) zero or more chunk operations. Consistency lives in the metadata tier; bytes are eventually-consistent-at-scale across the chunk tier without correctness impact.

See concepts/metadata-data-split-storage for the concept- level discussion.

Canonical wiki instances

Fly.io Sprites (2026-01-14)

"The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data ('chunks') and metadata (a map of where the 'chunks' are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage."

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Metadata tier: SQLite + Litestream. Chunk tier: S3-compatible object store. Shape: per-Sprite POSIX filesystem.

JuiceFS (upstream)

JuiceFS is the named inspiration — a POSIX FS with pluggable metadata backends (Redis, MySQL, Postgres, TiKV, SQLite) over any S3-shape object store. Fly.io forked it; the architectural pattern is the same.

Tigris (2024-02-15)

Metadata tier: FoundationDB clusters per region. Chunk tier: NVMe byte cache per region + optional S3-compat backend. Shape: S3-compat object store.

LiteFS (adjacent variant)

Metadata tier: Primary-node-managed lease + LTX log. Chunk tier: LTX frames shipped across replicas. Shape: FUSE-based SQLite replication. (Not a pure instance — LTX frames are both metadata and data — but architecturally adjacent: consistency is concentrated in the lease + log; bytes flow async across replicas.)

HDFS / Colossus / MooseFS / Lustre

Metadata-server + block-server shape at datacenter scale. Named servers, named blocks, block replication policy lives in the metadata server. Same pattern different scale.

Why the split keeps winning

  • Consistency is small, bytes are big. A single-digit-GB metadata DB can handle petabytes of bytes' worth of transactional ordering.
  • Bytes want async replication. Chunk tiers scale horizontally by adding replicas/regions, with metadata as the source of truth for placement.
  • Independent operational knobs. Metadata DB backups/restore/failover is one playbook; object-store ops is another. No single playbook has to cover both.
  • Cheap snapshots and forks. [[concepts/fast-checkpoint- via-metadata-shuffle|Snapshot = clone metadata; restore = re-point to the clone]]. Chunks are shared across snapshots for free.
  • Control / data-plane separation at the layer boundary. Metadata is the control plane; bytes are the data plane.

Recursive application: object-store-rooted metadata tier (2026-02-04)

The default assumption of this pattern is that the metadata tier is a traditional DB on local-ish durable storage (Redis on a replica set, FoundationDB on a regional cluster, SQLite on a durable volume). sources/2026-02-04-flyio-litestream-writable-vfs introduces a variant where the metadata tier itself applies this pattern recursively: Fly.io's Sprite "block map" is a JuiceFS metadata backend running on SQLite + Litestream VFS in writable + hydration mode — i.e., the metadata tier is itself object-store-rooted, served via HTTP Range GETs against LTX files with a background-hydrated local file for steady state.

Concretely:

Sprite user data (files)
    ├── metadata: which chunks are in which file
    │       ├── metadata-of-metadata: SQLite page index
    │       │       └── stored in LTX files on object storage
    │       │           (Range GET + LRU cache + background hydrate)
    │       └── stored in SQLite (Litestream VFS)
    └── chunks: on object storage (JuiceFS)

Both the user bytes and the per-Sprite filesystem metadata root at object storage. The local host has no durability responsibility — Sprite migration is a pointer-move, no block replication required.

The recursive shape works because:

  • Block maps are small ("low tens of megabytes worst case"), so Litestream VFS's LTX-page-lookup + Range-GET model is cost-effective.
  • Single-writer semantics hold trivially — each Sprite has one VM writing to its block map.
  • The hydration mode bounds steady-state read latency to local-file speed once the VFS has hydrated.
  • Cold boot cost (Range-GETs from object storage while serving an incoming HTTP request) is tolerable because block maps are small.

This is a distinct deployment shape of the pattern worth calling out — the canonical choice axes table above extends with a third row at the metadata substrate column: Shape 3 = "Strongly-consistent DB hosted in object storage via a page-level-read VFS" (Litestream VFS).

Canonical choice axes

Axis Shape 1 Shape 2
Metadata substrate Strongly-consistent DB (FDB, SQLite) Eventually-consistent KV (S3 ETags, DynamoDB)
Chunk substrate S3 / GCS / Azure Blob Local NVMe + async replicate
Chunk addressing Content-hash (content-addressable) Opaque IDs with version field
Filesystem shape POSIX (JuiceFS, Sprites) Object store (Tigris, S3)
Replica strategy Demand-driven (Tigris) Pre-placed (HDFS replication)

Trade-offs

  • Two-system ops burden. Strictly worse than a one-system design on the ops axis.
  • Cross-tier skew windows. Metadata says "bytes exist" a few ms before bytes actually arrive on some replica. Readers must handle miss-then-retry.
  • Metadata is the scaling bottleneck. Write TPS through the metadata DB is the overall ceiling.
  • Chunk GC is non-trivial. Unreferenced chunks pile up if metadata-driven GC doesn't run regularly.
  • Debugging across two systems. "Is the problem metadata or bytes?" is a new question ops teams have to answer.
  • Pluggable metadata backend != free swap. The Sprites team "rewrote" the SQLite metadata backend for JuiceFS; swapping between Redis / SQLite / FoundationDB is real work.

Relation to patterns/metadata-db-plus-object-cache-tier

The two patterns are close relatives:

  • Metadata + chunk storage stack (this pattern) — general architectural shape: two tiers, independent substrates. Covers filesystems, object stores, volumes, databases.
  • Metadata-db + object-cache tier — specialised to a global object store with per-region cache tier (Tigris shape). One specific deployment shape of this pattern.

Seen in

  • [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — Sprites canonical.
  • [[sources/2024-02-15-flyio-globally-distributed-object- storage-with-tigris]] — Tigris canonical.
  • sources/2026-02-04-flyio-litestream-writable-vfs — the recursive-split variant. Sprite block map = SQLite + Litestream VFS in writable + hydration mode, making the metadata tier itself object-store-rooted. First wiki instance where the pattern is applied to its own metadata tier.
Last updated · 319 distilled / 1,201 read