Skip to content

PATTERN Cited by 1 source

Metadata-DB + object-cache tier

Architect a globally distributed object store as three independent layers rather than one monolithic service:

  1. Metadata plane — a strongly-consistent, transactional database holding per-object records (name, version, size, replica-set, lifecycle state). Usually multi-region but replicated inside each region, not across.
  2. Byte-cache plane — regional, fast-local storage (NVMe, local SSD) that caches raw object bytes close to compute, usually with demand-driven population and bounded retention.
  3. Origin / archival plane — an authoritative (and/or archival) byte store behind the cache tier. Often another object store (S3, GCS) or the same system configured as a cold backend.

Writes commit to the metadata plane first; bytes land in the local byte cache and are then asynchronously propagated (to replicas, to demand regions, to the origin) by a queuing layer. Reads always consult metadata first; the byte cache is filled on demand (with a threshold-based eager-push for small objects) and the origin is consulted on cache miss.

Canonical shape

  • Metadata layer: FoundationDB or similar. Multi-key ACID transactions are the point — atomicity across a write's metadata updates (e.g. "new version committed", "replica set extended", "delete marker added") is where consistency lives.
  • Byte-cache layer: local NVMe / SSD per region. The latency floor, and where the vast majority of reads are served.
  • Distribution layer: durable queue. A QuiCK-style FDB-native queue, or Kafka, or an equivalent. Moves bytes between regions and out to the origin tier.
  • Origin/archival layer: external or same-system cold tier. Often S3 — either because customers already have S3 data to bridge, or because the archival-tier economics are already solved there.

Canonical instance — Tigris

Tigris (on Fly.io) is the concrete reference case this wiki carries. Fly.io's description:

"Tigris runs redundant FoundationDB clusters in our regions to track objects. They use Fly.io's NVMe volumes as a first level of cached raw byte store, and a queuing system modelled on Apple's QuiCK paper to distribute object data to multiple replicas, to regions where the data is in demand, and to 3rd party object stores… like S3."

(Source: sources/2024-02-15-flyio-globally-distributed-object-storage-with-tigris)

Maps exactly onto the three layers:

  • Metadata → FoundationDB clusters per Fly.io region.
  • Byte cache → Fly.io NVMe volumes per region.
  • Distribution → QuiCK-style queue.
  • Origin/archival → S3 (optional, pluggable).

Why this shape

  • Consistency concentrated in metadata. Bytes are immutable; the only thing that actually needs transactional ordering is the metadata state (which version is current, which regions hold replicas, which objects are tombstoned). That's a small keyspace compared to the bytes — well-suited to a KV DB that trades throughput for ACID.
  • Byte plane can be eventually-consistent. Once metadata says "object X version 3 exists at replicas {us-west-2, ap-south-1}", the byte-plane work to materialise version 3 in other regions is "just" async replication. There is no correctness issue with lagging bytes as long as the metadata tells readers where to look.
  • Regional locality for reads. The NVMe cache is the read latency floor. Small objects get eagerly pushed everywhere; large objects materialise on first local read.
  • Pluggable cold tier. Splitting the origin/archival plane from the byte cache lets operators choose the economics — S3 for customers who want to keep existing buckets, local HDD for custom archival, multi-vendor for redundancy.
  • Control / data-plane separation at the layer boundary. Metadata is the control plane for the byte plane; byte-plane hiccups don't affect metadata consistency.

Trade-offs

  • Two replication pipelines to operate. Metadata replication (FDB-internal) + byte replication (queue-driven) are separate systems with separate failure modes, monitoring, and tuning knobs. The 2-system blast radius is strictly worse than a single-system design.
  • Cross-plane skew windows. After a metadata commit acknowledging a new version, the bytes may not yet exist in every region that metadata says can serve them. Read paths have to handle "metadata says yes, byte cache says no" with some form of pull-on-miss from the origin or from a replica that has the bytes. This is usually transparent but not free.
  • Metadata is the scaling bottleneck. FDB's strictly- serializable write throughput is lower than the aggregate byte-plane throughput. Workloads with very high write TPS per object (metadata-heavy) hit the metadata ceiling before the byte ceiling.
  • Pluggable-origin complexity. Letting S3 be a backend means supporting its latency, bandwidth, rate-limit, and consistency characteristics — all different from the internal NVMe cache tier. Edge cases (origin returning stale bytes, origin throttling mid-sync) cross the layer boundary.
  • Not a drop-in for existing single-region apps. Apps assuming "one bucket in one region" work fine on the S3-compat front, but benefit only when the regional-first byte plane is doing meaningful work — which requires users / compute to actually span regions.

Relationship to adjacent patterns

  • patterns/caching-proxy-tier — the byte-cache layer is shape-adjacent to a caching proxy, but here the local region is a first-class replica (coherent, discoverable via metadata), not a miss-then-pull cache. The article calls this out: "Tigris isn't a CDN, but rather a toolset that you can use to build arbitrary CDNs, with consistency guarantees, instant purge and relay regions."
  • patterns/presentation-layer-over-storage — the S3-compat API is a presentation layer over a different- shaped backend; the same-shape / different-guts design Warfield describes for S3-at-19 works the other way around here (same S3 shape on top, different architecture below).
  • concepts/control-plane-data-plane-separation — the shape of the layer boundary between metadata and byte cache.
  • concepts/demand-driven-replication — the byte-plane replication policy this pattern naturally enables.

Seen in

Last updated · 200 distilled / 1,178 read