Cloud Topics: the Metastore¶
Summary¶
This post from the Redpanda engineering blog (part of the Cloud Topics series) describes the metastore — a custom-built, Raft-replicated key-value store that serves as the metadata tier for Cloud Topics. The metastore maps Kafka offsets to byte-range positions within L1 objects in object storage, enabling consumers to locate data efficiently. Rather than using an external metadata service or the existing tightly-coupled in-memory spill format from Tiered Storage, Redpanda built a general-purpose LSM-tree key-value store with pluggable persistence layers: SSTables go to object storage (with write-through local cache), the manifest replicates via Raft, and the WAL is the Raft log itself.
Key Takeaways¶
-
Metastore role: Maps Kafka offsets → (object-storage object, byte range) plus leader term boundaries, compaction state, and protocol-serving metadata. Scales independently of memory and local disk.
-
Why build custom: Tiered Storage's existing in-memory ordered map with spillover was too tightly coupled to its specific metadata format — adding new metadata kinds required reworking both the serialization format and the spill-to-cloud logic. An external KV store would add operational burden and cost for customers.
-
LSM-tree architecture: Inspired by LevelDB/RocksDB. Writes land in a WAL → absorbed into an in-memory memtable → periodically flushed to sorted SSTables. A manifest tracks what SSTables exist and their key ranges.
-
Partitioning scheme: Each metastore partition gets its own LSM instance backed by its own Raft group within an internal Redpanda topic. User topic-partitions are hash-assigned to a configurable number of metastore partitions.
-
Pluggable persistence — SSTables: Written to object storage with a write-through local cache. Flushes and compactions write to both object storage and local cache, so reads avoid cloud latency on every lookup. This is what enables metadata to scale beyond single-node memory/disk.
-
Pluggable persistence — Manifest: Each flush writes the manifest to object storage AND replicates it to all replicas through Raft. Followers immediately know which SSTables exist and can take over quickly on leader failure. The same Raft entry marks a truncation point — operations already captured in the manifest can be trimmed from the log.
-
Pluggable persistence — WAL = Raft log: Instead of a separate WAL file, writes replicate through a Raft group and are persistent once acknowledged by a majority. Reuses Redpanda's existing Raft infrastructure.
-
Fast failover: Because both manifest and WAL are Raft-replicated, the new leader already has everything locally — no remote recovery needed.
-
Disaster recovery (Whole Cluster Restore): Because SSTables and manifests are in object storage, the internal metastore topic can be rebuilt directly from object storage, then each partition's last-written offset is recovered to know where to continue writing.
-
Read replicas from object storage only: A read-replica topic bootstraps by downloading the metastore manifest, pulls SSTables from object storage on demand, and serves consumers using the same metadata queries as the source cluster — with no direct connection to the source.
-
10-minute flush interval / RPO tradeoff: Default memtable flush is every 10 minutes. This defines the RPO (max metadata that could be lost in disaster) AND the upper bound on read-replica lag. Chosen as a balance between durability and write amplification.
Operational Numbers¶
| Parameter | Value |
|---|---|
| Default memtable flush interval | 10 minutes |
| RPO (recovery point objective) for metadata | ≤ 10 minutes |
| Read-replica metadata lag upper bound | ≤ 10 minutes |
Systems / Concepts / Patterns Extracted¶
Systems: Redpanda Cloud Topics Metastore (new), Redpanda Cloud Topics (update)
Concepts: Raft-replicated WAL as LSM WAL, pluggable persistence layer, SSTable on object storage, manifest replication via Raft, flush interval / RPO tradeoff, metadata scale independence
Patterns: Raft log as LSM WAL, SSTable to object store with write-through cache, manifest via Raft for fast failover, metastore bootstrap from object storage (for DR and read replicas)
Caveats¶
- No production numbers for metadata query latency or throughput.
- Compaction strategy details not disclosed (leveled? size-tiered? hybrid?).
- The "configurable number of metastore partitions" — no guidance on sizing.
- No discussion of write amplification from the Raft + object-store double-write.
- The 10-minute flush is "default" — tunability not discussed.
Source¶
- Original: https://www.redpanda.com/blog/cloud-topics-metastore
- Raw markdown:
raw/redpanda/2026-06-09-cloud-topics-the-metastore-e3bf63de.md
Related¶
- systems/redpanda-cloud-topics — parent system
- systems/redpanda-cloud-topics-metastore — the system described here
- concepts/lsm-compaction — underlying data structure
- concepts/wal-write-ahead-logging — WAL concept (here replaced by Raft)
- concepts/compute-storage-separation — metastore separates metadata compute from storage
- concepts/rpo-rto — flush interval defines RPO
- patterns/tiered-storage-to-object-store — SSTables stored remotely