Skip to content

SYSTEM Cited by 1 source

Redpanda Cloud Topics Metastore

The Cloud Topics Metastore is a custom-built, Raft-replicated key-value store embedded within Redpanda that serves as the metadata tier for Cloud Topics. It maps Kafka offsets to byte-range positions in L1 objects stored in object storage, enabling the read path to locate data without scanning.

Architecture

At its core, the metastore is an LSM tree with pluggable persistence layers, inspired by LevelDB/RocksDB but diverging in what backs each persistence layer:

Layer Traditional LSM Cloud Topics Metastore
Write-ahead log Local file Raft log (replicated, majority-ack durability)
SSTables Local disk Object storage + write-through local cache
Manifest Local file Object storage + Raft-replicated to all replicas

Each metastore partition gets its own LSM instance backed by its own Raft group within an internal Redpanda topic. User-created topic-partitions across the cluster are hash-assigned to a configurable number of metastore partitions.

Persistence details

  • SSTables are written to object storage with a write-through local cache. Both flushes and compactions go to object storage AND local cache, so reads avoid cloud latency on every lookup. This lets metadata scale beyond single-node memory or disk capacity.

  • Manifest: Each flush writes the manifest to object storage, then replicates it to all replicas through Raft. Followers immediately know which SSTables exist. The same Raft entry marks a log truncation point — older operations already captured in the manifest can be trimmed (Source: sources/2026-06-09-redpanda-cloud-topics-the-metastore).

  • WAL = Raft log: Instead of maintaining a separate write-ahead log file, writes to the metastore replicate through a Raft group. A write is considered persistent once acknowledged by a majority of the Raft group. Reuses Redpanda's existing Raft infrastructure.

Failover

Because both the manifest and the WAL (Raft log) are replicated, failover is fast — the new leader already has everything it needs locally with no remote fetch required.

Disaster recovery and read replicas

The pluggable-persistence design enables two key capabilities beyond scale:

  1. Whole Cluster Restore: The metastore can be rebuilt from object storage alone (SSTables + manifest), reconstructing the internal metastore topic and recovering each partition's last-written offset.

  2. Read replicas: A read-replica topic bootstraps by downloading the metastore manifest from object storage, pulls SSTables on demand, and serves consumers using the same metadata queries as the source — with no direct connection to the source cluster.

Both rely on the same tradeoff: metadata in object storage is only as fresh as the last flush (default 10 minutes), which defines the RPO for disaster recovery and the upper bound on read-replica lag.

Seen in

Last updated · 542 distilled / 1,571 read