Redpanda — Under the hood: Redpanda Cloud Topics architecture¶
Summary¶
Redpanda's 2026-03-30 post is the first architectural deep-dive on Cloud Topics after its general- availability release in Redpanda Streaming 26.1 (prior wiki coverage via the 25.3 beta launch post covered positioning and cost framing but not the internal mechanism). The post splits the design into five primitives: a Cloud Topics Subsystem that batches writes across partitions in memory, an L0 (Level 0) File that carries that batch to object storage, a placeholder batch written to the per- partition Raft log to anchor the object-storage location under Kafka's transactional/idempotency semantics, a background Reconciler that rewrites L0 files into per-partition L1 (Level 1) Files optimised for historical reads, and a per-partition Last Reconciled Offset watermark that routes reads between L0 and L1. The separation of where metadata lives (in-broker Raft log) from where data lives (object storage) is what lets Cloud Topics sidestep the cross-AZ replication bandwidth cost while preserving Kafka's standard produce-path guarantees.
Key takeaways¶
-
The architectural split is: metadata in Raft, data in object storage. "The Cloud Topics architecture separates where metadata is stored (each partition's Raft log) and where data is stored (object storage). Traditionally, the data and metadata for the records that are produced are written and replicated using the Raft consensus protocol. Since Cloud Topics writes data directly to object storage, we can bypass the Cross-AZ networking tax incurred when replicating via Raft." This is the load-bearing architectural claim; the rest of the post explains the mechanism that makes it safe. (log-as-truth variant: the log pointer is the truth, data lives in object storage.) (Source: this post)
-
Writes are batched across partitions before the object- storage PUT, to kill the small-file problem. "We batch incoming data in memory for a short window defined by time (e.g., 0.25 seconds) or size (e.g., 4MB). We collect this data across all partitions and topics simultaneously. We do this specifically to minimize the cost of object storage; by aggregating smaller writes into larger batches, we significantly reduce the number of PUT requests sent to S3." The 0.25-second / 4 MB example is the first first-party published window for Cloud Topics. This is a concrete instance of batching-latency trade-off at the broker layer with multi-partition coalescing (not per-partition) as the explicit cost- optimisation lever. (Source: this post)
-
The placeholder batch is what preserves Kafka transactions/idempotency. After the L0 file lands durably, "we replicate a placeholder batch containing the location of the data to the corresponding Raft log for each batch involved in the upload. Then we send an acknowledgement to the producer that the batch is safely persisted." Because the placeholder still flows through the "battle-hardened produce path", Cloud Topics inherits the same transaction and idempotency logic as standard topics — "the data payload lives in the cloud, but the guarantees live in Redpanda". This is the key design choice that makes Cloud Topics a drop-in Kafka topic class rather than a separate API. (Source: this post) See concepts/placeholder-batch-metadata-in-raft.
-
L0 is ingest-optimised, L1 is read-optimised; the Reconciler rewrites L0→L1 in the background. L0 files contain "data from many different partitions batched together" — fast and cheap to write, but "reading a single partition's history would require 'scattered reads' across many different files". The Reconciler "reads the L0 files and reorganizes the data, grouping messages that belong to the same partition and writing them into L1 (Level 1) Files." L1 files are "much larger", "co-located" (per-partition), and "sorted" (by offset). L0 becomes eligible for garbage collection once its data is successfully in L1. This is a compaction-style rewrite at the object-storage tier rather than at LSM-tree altitude, driven by read-pattern mismatch with the ingest layout. (Source: this post) See concepts/l0-l1-file-compaction-for-object-store-streaming.
-
The read router uses a per-partition Last Reconciled Offset watermark. "When a consumer requests data, Redpanda routes the request based on where the data currently lives in its lifecycle. Each partition tracks a Last Reconciled Offset."
- "Reads > Last Reconciled Offset: The system reads from L0. The system follows the pointers in the local Raft logs to find the specific batches in object storage if not found in the local cache."
-
"Reads < Last Reconciled Offset: The system reads from L1. This is the highly optimized path for historical reads, allowing us to open large, sorted files and stream data efficiently without scattering." This watermark is the minimal coordination point between the write path (which only appends to L0) and the read path (which branches at this one boundary). See concepts/last-reconciled-offset. (Source: this post)
-
Tailing consumers hit the memory cache; L0 object-storage reads are a cache-miss fallback. "L0 files are optimized for fast, cheap ingest. For tailing consumers, which represent the vast majority of streaming workloads, data is typically read directly from the memory cache, offering low latency. However, if a consumer falls behind and needs to read from storage (a cache miss), reading from L0 can be inefficient." This framing means the L0-scattered-read cost only affects consumers that fall behind
Last Reconciled Offset - memory_cache_horizon— Reconciler cadence effectively bounds the window where scattered reads can happen. (Source: this post) -
L1 metadata lives in a shared metadata tier, not the Raft log of each partition. "Metadata for L1 files are stored in a shared metadata tier that's backed by an internal topic and a key-value store. This ensures that the system maintains a robust, consistent view of where your optimized data resides. This includes updating metadata as the underlying data is rewritten by compaction, and removed as the retention policy kicks in." So the on-broker Raft log of each partition only tracks L0 placeholders; L1 is tracked separately — a sensible split because L1 state changes (from compaction + retention) happen at background-process cadence, not per-produce cadence. (Source: this post)
-
Cloud Topics is now GA in Redpanda Streaming 26.1. "With the release of Redpanda Streaming 26.1, Cloud Topics has officially entered General Availability." Prior status on the system page (beta in 25.3 preview, 2025-11-06) needs updating. (Source: this post)
Operational numbers disclosed¶
- Batch window (example): 0.25 seconds or 4 MB (first- party Redpanda numbers; the post words them as "e.g." so these are defaults / illustrative rather than hard-coded). Count-trigger-vs-bytesize-trigger composition — same axis discussed for producer batching but now at the broker-to-object-storage layer.
- Multi-partition coalescing: all partitions and all topics batched simultaneously per write window (not per- partition batching). Explicit motivation: "by aggregating smaller writes into larger batches, we significantly reduce the number of PUT requests sent to S3."
- Reconciler output (L1 files): "much larger" (quantification not given); "co-located" (one partition per file, with "data for a specific partition range" grouped physically); "sorted" by offset.
- Read routing watermark granularity: per-partition (each partition has its own Last Reconciled Offset).
- Release vehicle: Redpanda Streaming 26.1 (GA release).
Caveats and gaps¶
- No absolute latency numbers — the post gives no P50/P99 for produce ack, no per-read latency for L0-miss vs L1-hit, no Reconciler cadence, no cache-hit-rate target. The batch window (0.25 s) places a floor on p99 produce latency for Cloud Topics but the post doesn't quantify the tail.
- No cost numbers — eliminating cross-AZ cost replaces it with PUT-request cost + background-compaction egress / storage cost. Net delta isn't disclosed.
- Reconciler placement not disclosed — is the Reconciler a partition-leader-local task? A pool of broker processes? A separate fleet? The post doesn't say.
- No metadata-tier scale numbers — L1 metadata lives in "an internal topic and a key-value store", but topic name, KV-store engine, and scale ceiling aren't disclosed.
- No schema-evolution / compaction-policy discussion — garbage-collection of L0 files after L1 ingest is named but not quantified; interaction with topic retention policy at the L1 level is named but not quantified.
- No cache-design disclosure — "memory cache" for tailing consumers is named but not architected. Is it per-broker? Follower-cache-aware? LRU? Tail-truncating?
- No failure-mode discussion — partial L0 upload failure, placeholder-without-data gap, Reconciler crash mid-rewrite, L1-metadata-tier unavailability — none are covered. The post is an architecture explainer, not a correctness / recovery writeup.
Source¶
- Original: https://www.redpanda.com/blog/cloud-topics-architecture
- Raw markdown:
raw/redpanda/2026-03-30-under-the-hood-redpanda-cloud-topics-architecture-ab76f366.md
Related¶
- systems/redpanda-cloud-topics — the system this post explains; page is updated from positioning-only to include the Cloud Topics Subsystem / L0 / Reconciler / L1 / Last Reconciled Offset architecture.
- systems/redpanda — the broker hosting Cloud Topics.
- concepts/placeholder-batch-metadata-in-raft — the mechanism that lets object-storage-backed data inherit Kafka's transactional/idempotency guarantees.
- concepts/l0-l1-file-compaction-for-object-store-streaming — the two-tier object-storage file layout.
- concepts/last-reconciled-offset — the per-partition watermark routing reads between L0 and L1.
- concepts/cross-az-replication-bandwidth-cost — the cost axis Cloud Topics attacks.
- concepts/small-file-problem-on-object-storage — avoided on the write path by multi-partition coalescing; solved on the read path by the L0→L1 Reconciler.
- concepts/batching-latency-tradeoff — the 0.25 s / 4 MB batch window is a concrete instance at the broker-to- object-storage layer.
- concepts/log-as-truth-database-as-cache — the Raft log is the source of truth for what bytes exist; object storage is the addressable data cache.
- concepts/latency-critical-vs-latency-tolerant-workload — the workload-class framing motivating per-topic storage tiering.
- patterns/object-store-batched-write-with-raft-metadata — the canonical pattern Cloud Topics instantiates.
- patterns/background-reconciler-for-read-path-optimization — the broader pattern family (L0 is write-optimised, L1 is read-optimised, background process bridges them).
- patterns/tiered-storage-to-object-store — broader family; Cloud Topics is the extreme point where object storage is the primary rather than cold tier.
- patterns/per-topic-storage-tier-within-one-cluster — Cloud Topics is one of the four topic classes in Redpanda's multimodal-streaming vocabulary.
- companies/redpanda — the company shipping the feature.