CONCEPT Cited by 2 sources
Metadata/data-split storage¶
Definition¶
A storage-system architectural shape: split persistent state into two layers that are operated and scaled independently:
- A small transactional metadata tier — ACID or near-ACID, mutated often, holds the map of "what bytes exist where". Usually a database (SQLite, FoundationDB, Redis, MySQL, Postgres) or a strongly-consistent KV store.
- A large immutable content tier — append-only chunks / objects keyed by content hash or opaque ID, stored on cheap durable substrate (object storage). Mostly written once, read many.
The two tiers speak different consistency languages. Reads and writes at the filesystem / volume / database API are composed of one metadata operation + zero-or-more content operations.
Canonical wiki statements¶
Fly.io Sprites (2026-01-14)¶
"The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data ('chunks') and metadata (a map of where the 'chunks' are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage."
(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])
Tigris (2024-02-15, earlier wiki instance)¶
"Tigris runs redundant FoundationDB clusters in our regions to track objects. They use Fly.io's NVMe volumes as a first level of cached raw byte store, and a queuing system modelled on Apple's QuiCK paper to distribute object data to multiple replicas, to regions where the data is in demand, and to 3rd party object stores… like S3."
(Source: [[sources/2024-02-15-flyio-globally-distributed- object-storage-with-tigris]])
Why this shape keeps appearing¶
The argument for the split (across JuiceFS, Tigris, Sprites, most modern object stores, many large-scale filesystems):
- Consistency is small, bytes are big. The transactional-ordering requirement lives almost entirely in metadata: "version 3 of this object now exists, replicas in {us-west-2, ap-south-1}, tombstoned at t=…" This is a small-keyspace workload well-suited to an ACID DB.
- Bytes are immutable once written. No per-byte coordination is needed; async replication is sufficient.
- Independent scaling. Metadata QPS and byte QPS move on different axes. Letting each tier scale with its own knob (shards, replicas, regions) beats one-big-system scaling.
- Failure-domain separation. Metadata-tier outage ≠ byte-tier outage. Cache-miss (byte) is not a corruption event (metadata).
- Cheap snapshots / forks. Snapshot = copy metadata. Content is already content-addressable and shared across snapshots.
Relation to control/data-plane separation¶
This is concepts/control-plane-data-plane-separation specialised to storage: metadata is the control plane, content is the data plane. The layer boundary is the place to enforce the split — readers always consult metadata first, writers always commit to metadata after content, etc.
Instances on the wiki¶
- Sprites — SQLite metadata (local, Litestream-backed) + object-store chunks. JuiceFS- lineage.
- JuiceFS — canonical open-source implementation. Multiple pluggable metadata backends.
- Tigris — FoundationDB metadata + NVMe byte cache + S3-compat origin.
- AWS S3 internals (not this post, but widely known) — internal metadata service + storage nodes.
- LiteFS (architecturally adjacent) — SQLite-specific variant, shipping LTX-format frames (data) + primary-node-managed state (metadata).
- HDFS / Colossus / MooseFS / Lustre — metadata-server + block-server shape is the same architectural pattern at larger scale.
- Most modern object stores — metadata service + placement-driven byte storage.
Trade-offs¶
- Two operational systems. Metadata DB has its own backup / failover / scaling story. Byte store has its own. The 2-system ops burden is strictly greater than a single- system design.
- Metadata throughput ceilings. Metadata TPS is the overall write ceiling. Workloads with very high write turnover per object (lots of version churn) hit metadata limits before byte limits.
- Cross-tier skew. Metadata commits can temporarily refer to bytes that haven't yet landed at all requested replicas. Read paths must tolerate miss-then-fetch.
- Complexity for small deployments. For a single-tenant single-host workload, the split is over-engineered. Shines when there's meaningful scale or multi-location requirement.
Metadata-tier backend choices on the wiki¶
| Product | Metadata tier | Byte tier |
|---|---|---|
| Sprites | SQLite + Litestream | S3-compatible object storage |
| Tigris | FoundationDB | NVMe byte cache + S3 origin |
| JuiceFS stock | Redis / MySQL / Postgres / TiKV / etc. | S3-compat (any) |
| LiteFS | Primary-node lease + LTX log | LTX frames distributed via replica set |
Seen in¶
- [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — JuiceFS-lineage at Sprites.
- [[sources/2024-02-15-flyio-globally-distributed-object- storage-with-tigris]] — FoundationDB-lineage at Tigris.
Related¶
- systems/fly-sprites
- systems/juicefs
- systems/tigris
- systems/litestream
- systems/sqlite
- systems/foundationdb
- systems/aws-s3
- concepts/object-storage-as-disk-root
- concepts/immutable-object-storage
- concepts/control-plane-data-plane-separation
- patterns/metadata-plus-chunk-storage-stack
- patterns/metadata-db-plus-object-cache-tier
- companies/flyio