SYSTEM Cited by 7 sources
Redpanda Cloud Topics¶
Cloud Topics is a Redpanda 25.3 feature (2025-11-06 preview, beta at launch) that adds a new topic class whose data is written directly to object storage (S3 / ADLS / GCS) while topic metadata is managed in-broker and replicated via Raft. Cloud Topics coexist with traditional Redpanda topics (NVMe-backed, Raft-replicated data), Iceberg topics (Parquet-projected lakehouse sink), and write-caching topics in the same cluster — delivering "one system for all streams" rather than separate clusters per workload class.
Canonical verbatim¶
From the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:
"With Cloud Topics, each batch of messages is passed straight through and written to cost-effective object storage (S3/ADLS/GCS) while topic metadata is managed in-broker — replicated via Raft for high availability — so the cluster can do its job (partition leadership, replica placement, quotas, governance)."
"This approach virtually eliminates the cross-AZ network traffic associated with data replication. You keep millisecond performance where it matters, and pay object-store prices for replication where it doesn't, without sacrificing durability."
Architectural shape¶
Cloud Topics sits at a specific point in the streaming-broker storage-tiering design space:
| Shape | Hot data | Cold data | Metadata | Cross-AZ replication |
|---|---|---|---|---|
| Traditional Redpanda topic | NVMe | NVMe (no tiering) | Raft (in-broker) | Yes — RF-1 copies |
| Redpanda tiered topic | NVMe | Object store | Raft (in-broker) | Yes — hot tier only |
| Cloud Topic (25.3) | Object store | Object store | Raft (in-broker) | No — durability inherited from object store |
| WarpStream | Object store | Object store | External / metadata store | No |
| Confluent Kora (standard/dedicated) | Replica disks | Tiered storage | In-broker | Yes |
Cloud Topics is within-cluster tiering — topics pick their storage substrate independently. Customers can host mission- critical payments topics on NVMe and latency-tolerant observability streams on object storage in the same cluster, with one set of IAM policies, one GitOps workflow, one Kafka API endpoint.
Why per-topic, not per-cluster¶
The post canonicalises the latency-critical-vs-latency-tolerant workload distinction as the motivation:
"Some data sets are latency-critical (e.g., payments, trading, cybersecurity), and others are latency-tolerant (e.g., observability, model training, compliance reporting). Treating those workloads the same is inefficient."
Per-topic tiering means cost structure aligns with business value at the topic granularity rather than forcing a cluster-level commitment.
Cost framing¶
The load-bearing cost claim:
"This approach virtually eliminates the cross-AZ network traffic associated with data replication."
See concepts/cross-az-replication-bandwidth-cost for the cost axis Cloud Topics attacks. Multi-AZ Raft replication — the default HA shape on Redpanda traditional topics — writes each produced byte to RF-1 brokers in other AZs, each of which is a billed cross- AZ transfer. Cloud Topics route writes directly to object storage (which inherits its own multi-AZ durability from the cloud provider), so the per-write cross-AZ cost is replaced by object- store PUT cost.
Business-impact framing from the post:
- "Dramatically lower TCO: Sidestep steep cloud provider networking charges for compliance, security, training data, or batch analytics."
- "Architectural simplification: Stop using a separate platform or cluster just to handle latency-tolerant streaming workloads."
- "A single multimodal streaming engine: … all in one platform. That's less infrastructure to manage and a cleaner mental model for every team."
Position against Confluent¶
Explicit comparison from the post:
"Contrast this with Confluent, where you may need a mix of Kora-powered Confluent Cloud clusters (standard/dedicated or Freight) and the separate Confluent WarpStream engine (BYOC) to satisfy different requirements."
Redpanda's differentiator: one cluster, multi-tier storage vs. Confluent's multi-cluster, per-cluster-tier split.
Multimodal-streaming composition¶
"Run traditional Redpanda topics for low latency and data safety, use write caching for ultra-low latency, Iceberg Topics for push-button lakehouse ingestion, and Cloud Topics for cost- efficient, high-throughput streaming — all in one platform."
Cloud Topics is the fourth topic class in Redpanda's composition vocabulary alongside:
- Traditional topics — NVMe-backed, Raft-replicated, lowest latency.
- Write-caching topics — ack-on-memory with background flush, ultra-low-latency.
- Iceberg topics — simultaneously a Kafka topic and an Iceberg table.
- Cloud topics — object-storage-backed, cost-optimised.
Status (2026-03-30)¶
Generally Available in Redpanda Streaming 26.1 (per the 2026-03-30 architecture deep-dive: "with the release of Redpanda Streaming 26.1, Cloud Topics has officially entered General Availability."). Beta in the prior 25.3 preview (2025-11-06).
Architecture¶
The 2026-03-30 deep-dive is the first detailed public description of Cloud Topics' internals. The load-bearing architectural split:
"The Cloud Topics architecture separates where metadata is stored (each partition's Raft log) and where data is stored (object storage). Traditionally, the data and metadata for the records that are produced are written and replicated using the Raft consensus protocol. Since Cloud Topics writes data directly to object storage, we can bypass the Cross-AZ networking tax incurred when replicating via Raft."
Write path¶
- Kafka API entry — producer records enter the standard Kafka API layer.
- Cloud Topics Subsystem staging — instead of appending to the local Raft log's on-disk payload, records are routed to an in-memory multi-partition staging buffer.
- Batch trigger — the buffer flushes on time or size: "we batch incoming data in memory for a short window defined by time (e.g., 0.25 seconds) or size (e.g., 4MB). We collect this data across all partitions and topics simultaneously. We do this specifically to minimize the cost of object storage; by aggregating smaller writes into larger batches, we significantly reduce the number of PUT requests sent to S3."
- L0 file upload — buffer flushes to cloud object storage as a single file. "We flush this batch directly to cloud object storage. We call this an L0 (Level 0) File." See concepts/l0-l1-file-compaction-for-object-store-streaming.
- Placeholder batch replication — "once the L0 file is safely durable in the cloud, we replicate a placeholder batch containing the location of the data to the corresponding Raft log for each batch involved in the upload." See concepts/placeholder-batch-metadata-in-raft.
- Producer ack — "then we send an acknowledgement to the producer that the batch is safely persisted."
The batch window (0.25 s example) places a floor on produce p99 latency for Cloud Topics — the feature is positioned for latency-tolerant workloads where this is acceptable.
Semantics preservation¶
"Because we still use the Raft log for this metadata, Cloud Topics inherit the same transaction and idempotency logic as our standard topics. The data payload lives in the cloud, but the guarantees live in Redpanda."
Kafka's transactional and idempotency protocols operate on record metadata (offsets, sequence numbers, transactional control records) — which still flow through Raft in order. The placeholder batch carrying an object-storage pointer is indistinguishable from a standard payload-carrying batch from the perspective of those protocols.
The Reconciler¶
L0 files are ingest-optimal but read-unfriendly: they contain data from many different partitions batched together, so reading a single partition's history would require "'scattered reads' across many different files."
A background process called the Reconciler continuously rewrites L0 into L1 files:
"The Reconciler continuously optimizes the storage layout. It reads the L0 files and reorganizes the data, grouping messages that belong to the same partition and writing them into L1 (Level 1) Files."
"L1 Files are: Much larger: Optimized for high-throughput object storage reading. Co-located: All data for a specific partition range is physically together. Sorted: Organized by offset."
"Once L0 data is successfully moved into L1, it's eligible for garbage collection and will eventually be removed."
L1 metadata lives in a separate shared metadata tier:
"Metadata for L1 files are stored in a shared metadata tier that's backed by an internal topic and a key-value store. This ensures that the system maintains a robust, consistent view of where your optimized data resides. This includes updating metadata as the underlying data is rewritten by compaction, and removed as the retention policy kicks in."
See patterns/background-reconciler-for-read-path-optimization.
Read path¶
The read router uses a per-partition watermark called the Last Reconciled Offset:
*"When a consumer requests data, Redpanda routes the request based on where the data currently lives in its lifecycle. Each partition tracks a Last Reconciled Offset.
- Reads > Last Reconciled Offset: The system reads from L0. The system follows the pointers in the local Raft logs to find the specific batches in object storage if not found in the local cache.
- Reads < Last Reconciled Offset: The system reads from L1. This is the highly optimized path for historical reads, allowing us to open large, sorted files and stream data efficiently without scattering."*
Tailing consumers "which represent the vast majority of streaming workloads" typically hit the memory cache first — the L0-scattered-read cost only appears for consumers that fall behind both the memory cache's and the Reconciler's trailing edges. Reconciler cadence effectively bounds that window.
Architecture-at-a-glance¶
Producer
│ produce()
▼
┌──────────────────────────────────────┐
│ Kafka API layer │
│ │ │
│ ▼ │
│ Cloud Topics Subsystem (in-memory) │
│ batch across ALL partitions/topics │
│ trigger: 0.25s or 4MB (example) │
└──────────────────────────────────────┘
│ PUT (single file)
▼
┌──────────────┐ ┌──────────────────────┐
│ Object store │───┐ │ Per-partition Raft │
│ L0 files │ │ │ log: placeholder │
│ (mixed │ │ │ batch (object-store │
│ partitions) │ │ │ pointer) │
└──────────────┘ │ └──────────────────────┘
│ │ │
│ │ Reconciler │ Producer ack
│ │ reads L0 │ after both
│ ▼ ▼
┌──────────────┐ Background rewrite
│ Object store │◀──per-partition, offset-sorted
│ L1 files │
└──────────────┘
│
│ Read router: requested_offset vs Last Reconciled Offset
│ > LRO → memory cache → L0 (via Raft placeholder)
│ < LRO → L1 (via shared L1 metadata tier)
▼
Consumer
What's not disclosed¶
The 2026-03-30 post stops short of several axes that will matter in production:
- Absolute latency — no P50/P99 for produce ack, no per-read latency for L0-miss vs L1-hit.
- Net cost — eliminating cross-AZ cost replaces it with PUT cost + background-compaction egress/storage cost; the net delta is not quantified.
- Reconciler placement — no disclosure of whether the Reconciler runs on partition leaders, a separate pool, or a detached fleet.
- Metadata-tier scale — the L1 metadata tier's internal topic name, KV-store engine, and scale ceiling are not disclosed.
- Failure-mode behaviour — partial L0 upload failure, placeholder-without-data gap, Reconciler crash mid-rewrite, and L1-metadata-tier unavailability aren't covered.
- Cache design — the "memory cache" that serves tailing consumers isn't architected (per-broker? LRU? tail-truncating?).
The Little's Law write-path retrofit (2026-05-05)¶
The [[sources/2026-05-05-redpanda-littles-law-in-practice-with-cloud-topics|2026-05-05
Little's Law post]] is the production-tuning sequel to the
2026-03-30 architecture deep-dive. It discloses a write-pipeline
issue that emerged during internal benchmarking: the Cloud Topics
write path as initially built had substituted a high-latency
upload phase for the prior NVMe replication phase, but had not
provisioned enough concurrency at that stage to absorb the latency
multiplier. Per-connection throughput was capped by
Little's Law at 1 / upload_latency —
roughly 1 RPS per producer connection at ~1 s worst-case object-
storage latency, vs the ~100 RPS per-connection ceiling of the
prior NVMe-replication path.
Verbatim framing of the bottleneck:
"the latency of the upload phase could be up to 100x slower than the replication phase. This makes it easy to feel the implication of Little's Law, which equates to Throughput = Latency * Concurrency."
The fix is a single extra queueing stage in the upload phase, sized to provide the in-flight concurrency the slow stage needs:
"To improve our throughput with the increased latency, we had to figure out how to increase concurrency. Just like last time, we addressed this with an additional layer of queuing. As shown in the image above, this extra queueing is introduced during the upload phase of the write path, before requests are processed by the replication layer, and helps hide the large latencies caused by cloud object storage."
"The additional queue becomes a mechanism for the producer and networking layers to release more batches into the system. And, with more requests queued in this new layer, we can upload more batches from a single producer in parallel."
Order is restored after the queue, not lost. The post is explicit:
"Once the data is uploaded, we preserve the ordering from the producer and release it into the replication layer, again waiting for the replication layer to ensure our position meets the needs of idempotent workloads. We still hold the producer acknowledgment until the metadata is fully replicated across the cluster. This means we preserve the correct ordering and data durability requirements at every stage while allowing more concurrency in the latency-bound part of request processing."
The producer ack is held until metadata is durable, so Kafka's
acks=all and idempotent-producer contracts are preserved. The
existing pre-Cloud-Topics
pipelined-
produce-with-position-guarantee technique provides the
order-restoration mechanism; the new upload queue layers on top.
Validation: OpenMessaging Benchmark hit GB/s scale "without needing to change any producer configurations." The fix is broker-internal — no Kafka client API surface change.
The architectural pattern this canonicalises: patterns/concurrency-buffer-stage-for-high-latency-io (the named pattern) and concepts/queue-depth-as-latency-hiding-mechanism (the conceptual underpinning). Cloud Topics is now the canonical wiki instance of this combined shape: patterns/object-store-batched-write-with-raft-metadata (batching across partitions to amortise PUT cost) + patterns/concurrency-buffer-stage-for-high-latency-io (in-flight concurrency to hide upload latency) + patterns/pipelined-produce-with-position-guarantee (producer- order restoration before metadata replication).
Retrospective lesson disclosed by Redpanda:
"When we discovered this bottleneck in our performance testing a few months ago, we realized we had been so focused on building functionality that we hadn't been focused on pushing real-world workloads through the system. Thanks to the entire Cloud Topics team, our carefully thought-out implementation readily accepted the fixes we needed, and what could have been a serious architectural oversight became an insightful process."
This is the candid post-hoc framing: the initial 2026-03-30 architecture would have been throughput-pinned without the upload-queue stage. Production-shape OMB testing surfaced it before GA shipped at 26.1.
Updated write path (post-Little's-Law retrofit)¶
Producer
│ produce()
▼
┌──────────────────────────────────────┐
│ Kafka API layer │
│ ingest checks (~<1ms) │
│ pipelined-produce position handoff │
│ (next request released after │
│ position guarantee, not after │
│ completion) │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Cloud Topics Subsystem (in-memory) │
│ batch across ALL partitions/topics │
│ trigger: 0.25s or 4MB (example) │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Upload concurrency-buffer queue │ ◄── retrofit
│ depth N ≈ T × upload_latency │
│ (Little's Law sizing) │
│ N concurrent uploads in flight │
│ per producer connection │
└──────────────┬───────────────────────┘
│ N parallel PUTs
▼
Object store (L0 files)
│
▼
┌──────────────────────────────────────┐
│ Order restoration │
│ merge concurrent uploads back into │
│ producer order before replication │
└──────────────┬───────────────────────┘
│
▼
Per-partition Raft log
(placeholder batch — metadata only)
│
▼
Producer ack (after metadata durable)
The retrofit's only structural change is the upload concurrency- buffer queue + the explicit order-restoration step. Everything upstream and downstream is unchanged.
L0 garbage collection (2026-05-19)¶
The [[sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection|2026-05-19 post]] (Part 1 of 2) discloses how Cloud Topics decides when an L0 object is safe to delete — a question made non-trivial by the fact that L0 objects are temporary by design, contain data from many partitions, and live across a fleet of brokers without a central index of which partitions still depend on which objects.
Reference counting is rejected¶
The "obvious" framing — each chunk of unreconciled data is a reference, delete when the count reaches zero — is walked through and explicitly rejected:
"this framing belies an ocean of complexity. First, these reference counts must be durable. Partitions themselves are spread across the cluster to balance load, so we'll need to support updates from anywhere. Do they need to be linearizable? […] These are real questions about distributed systems design. […] Redpanda does not track L0 objects this way."
The chosen approach is structurally different: a coarse-grained
logical timestamp (cluster epoch)
embedded in every L0 object ID at creation time, with the GC
decision reduced to comparing the stamp to a clusterwide safe-to-GC
watermark M. Architectural property the post leans on: "No
central index, no shared state, and no coordinated updates."
Cluster epoch — the load-bearing primitive¶
"The cluster epoch is a monotonically increasing counter that we embed in every L0 object ID at creation time. Since the epoch is updated periodically and only ever increases, any given epoch E must eventually age out of the cluster. Once we have reconciled every object created in epoch E, it stands to reason that any L0 object with that epoch can be safely deleted."
The cluster epoch converts a per-object reference-counting question into a per-epoch monotonic-bound question. Canonicalised as concepts/cluster-epoch and patterns/epoch-stamp-on-object-id-for-gc.
Per-partition safe-to-GC watermark M(p)¶
Each Cloud Topic partition publishes a local watermark M(p) —
the highest epoch for which all of its dependencies on objects
from that epoch are resolved (i.e. the Reconciler has lifted them
to L1). The clusterwide aggregate is M = min(M(p)) over all
partitions.
The first attempt — one tracked epoch per partition with strict rejection of older — was abandoned on a leadership-change failure mode:
"If partition leadership moves to a node with a stale epoch cache, we'll reject every new write until cache expiry, which could be minutes away."
The fix: maintain a sliding window of active epochs in a dedicated replicated state machine embedded in each partition's Raft log. Three fields:
| Field | Advance |
|---|---|
max_applied_epoch |
when a strictly greater epoch is committed |
previous_applied_epoch |
when we apply a new max_applied_epoch |
min_epoch_lower_bound |
when reconciler catches up to max_applied_epoch |
Active range: [previous_applied, max_applied]. The window slides
forward as new epochs are observed. The published per-partition
safe-to-GC watermark:
The third field decouples window advance (fast — slides on new observation) from safe-watermark advance (slow — waits for Reconciler catch-up). Canonicalised as concepts/sliding-window-epoch-tracking and patterns/per-partition-rsm-for-gc-tracking.
Lazy global aggregation via metadata dissemination¶
"Now that every Cloud Topic partition p tracks an inactive epoch in its own Raft log, all that's left is to combine these into a single global M. Turns out we can piggyback this information on an existing periodic metadata-dissemination service internal to Redpanda."
No new gossip protocol, no new control-plane service, no new RPC
topology — M(p) rides existing metadata channels.
Monotonicity makes stale observations always conservative-safe:
"If a node is temporarily operating on stale metadata, that's fine. A nice side effect of epoch monotonicity is that once we prove some M is safe, it never becomes unsafe. Every epoch < M is gone forever. Or until int64 rollover."
Canonicalised as patterns/lazy-aggregate-from-monotonic-local-state.
Architecture-at-a-glance¶
Stamping (creation)
epoch = current_cluster_epoch()
obj_id = encode(⟨epoch, …⟩)
write_object(obj_id, payload)
│
▼
L0 object in circulation, depended on by partitions p0..pn
Per-partition state (in Raft log)
┌──────────────────────────────────────────┐
│ max_applied_epoch │
│ previous_applied_epoch │
│ min_epoch_lower_bound ─► M(p) = prev(…) │
│ (admission control reads same state) │
└──────────────────────────────────────────┘
│
▼ piggyback on existing dissemination
Per-broker view: {M(p0), M(p1), …, M(pn)}
│
▼
M = min(M(p)) ← lazy, computed at edge
│
▼
Reclamation sweep
for obj in list_objects():
epoch_of(obj.id) ≤ M ? delete(obj) : skip
What the post does NOT cover (Part 2 of 2 forward-references)¶
"Stay tuned for part 2, where we discuss how the garbage collector's design enables us to continually delete thousands of L0 objects without any locally persistent state, explicit coordination, or wasted work."
Specifically deferred:
- The actual deletion mechanism (DELETE RPC distribution, work partitioning, idempotency on broker crash mid-delete).
- Epoch-advancement protocol (who decides to bump the cluster epoch, on what schedule, how it propagates).
- Reconciler catch-up signal mechanism (how the RSM learns the Reconciler is done with epoch E).
- Behaviour under int64 rollover.
Seen in¶
- sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection
— Part 1 of 2 on Cloud Topics' L0 garbage collection: the
decision mechanism. Walks through the rejected
reference-counting framing, then introduces the cluster-epoch
primitive (monotonic counter embedded in every L0 object ID at
creation), per-partition sliding-window epoch tracking in a
Raft-replicated state machine
(
[max_applied_epoch, previous_applied_epoch, min_epoch_lower_bound]), and lazy clusterwide aggregateM = min(M(p))piggybacked on Redpanda's existing metadata- dissemination service. Architectural slogan: "No central index, no shared state, and no coordinated updates." Canonicalises concepts/cluster-epoch, concepts/epoch-based-distributed-gc, concepts/sliding-window-epoch-tracking (concepts) + patterns/epoch-stamp-on-object-id-for-gc, patterns/per-partition-rsm-for-gc-tracking, patterns/lazy-aggregate-from-monotonic-local-state (patterns) for the wiki. Part 2 (deletion mechanism) forward- referenced but not yet ingested. - sources/2026-05-05-redpanda-littles-law-in-practice-with-cloud-topics
— production-tuning sequel to the architecture deep-dive.
Discloses that the initial Cloud Topics write pipeline was
throughput-pinned at ~1 RPS per producer connection because the
upload phase's ~100× latency multiplier was not absorbed by
enough in-flight concurrency. Fix: an extra queueing stage at
the upload phase, sized via Little's Law
(
Throughput = Latency × Concurrency), with order restoration before metadata replication. Validated on OpenMessaging Benchmark at GB/s scale "without needing to change any producer configurations." Canonicalises concepts/littles-law, concepts/queue-depth-as-latency-hiding-mechanism, concepts/storage-bottleneck-migration, patterns/concurrency-buffer-stage-for-high-latency-io, and patterns/pipelined-produce-with-position-guarantee for the wiki. - sources/2026-03-30-redpanda-under-the-hood-redpanda-cloud-topics-architecture — architecture deep-dive. First detailed public description of the Cloud Topics Subsystem / L0 files / placeholder batch / Reconciler / L1 files / Last Reconciled Offset mechanism. Confirms GA in Redpanda Streaming 26.1. Canonicalises concepts/placeholder-batch-metadata-in-raft, concepts/l0-l1-file-compaction-for-object-store-streaming, concepts/last-reconciled-offset, patterns/object-store-batched-write-with-raft-metadata, and patterns/background-reconciler-for-read-path-optimization.
- sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more — canonical wiki source introducing Cloud Topics as the 25.3 per-topic tiering feature. Motivates via the latency-critical- vs-latency-tolerant workload distinction and the cross-AZ replication cost axis.
- sources/2026-03-31-redpanda-261-delivers-the-industrys-first-adaptable-streaming-engine — Redpanda 26.1 GA launch post for Cloud Topics. Canonicalises the "disk-lite" architectural vocabulary and the >90% lower networking costs figure. "Cloud Topics use a pass-through write model that saves the bulk of your messages directly to object storage. We stream the heavy message payloads directly to S3 or GCS, but we keep the brains of the operation—the metadata and Raft consensus—on high-performance local NVMe." Frames Cloud Topics as the core of the pass-through-write + disk-lite shape explicitly positioned against WarpStream-class diskless architectures ("diskless isn't riskless"). Confirms Cloud Topics semantics preservation: "No broken transactions. No metadata lag. No external control plane dependencies."
Related¶
- systems/redpanda — the broker that hosts Cloud Topics.
- systems/aws-s3, systems/google-cloud-storage — two of the three backing object stores (Azure Blob is the third).
- systems/warpstream — the extreme point on the same design axis (everything on S3; Redpanda's Cloud Topic is a per-topic-selectable subset of that shape within a traditionally-deployed cluster).
- systems/confluent-kora — foil named in the post.
- systems/kafka — the wire protocol Cloud Topics still exposes to producers/consumers.
- concepts/cross-az-replication-bandwidth-cost — the cost axis the feature targets.
- concepts/latency-critical-vs-latency-tolerant-workload — the workload-class distinction that motivates per-topic tiering.
- concepts/tiered-storage-as-primary-fallback — the prior Redpanda tiering model (NVMe primary, object-storage secondary). Cloud Topics inverts this for selected topics.
- concepts/broker-write-caching — the parallel axis for ultra-low-latency topics.
- concepts/placeholder-batch-metadata-in-raft — the metadata mechanism that lets Cloud Topics preserve Kafka transactional/idempotency semantics while payload bytes live in object storage. Canonicalised from this feature.
- concepts/l0-l1-file-compaction-for-object-store-streaming — the two-tier object-storage file layout. Canonicalised from this feature.
- concepts/last-reconciled-offset — the per-partition watermark routing reads between L0 and L1. Canonicalised from this feature.
- concepts/small-file-problem-on-object-storage — the write-side of this is avoided by cross-partition batching into L0 files; the read-side is avoided by the Reconciler's L0→L1 rewrite.
- concepts/batching-latency-tradeoff — the 0.25 s / 4 MB batch window is a concrete instance at the broker-to-object- storage layer.
- concepts/log-as-truth-database-as-cache — the broader tenet the placeholder-batch-plus-object-store shape instantiates.
- concepts/stateless-compute — the WarpStream-style shape Cloud Topics approximates per-topic within a stateful- primary cluster.
- patterns/object-store-batched-write-with-raft-metadata — the canonical write-path pattern Cloud Topics instantiates.
- patterns/background-reconciler-for-read-path-optimization — the canonical read-path companion pattern.
- patterns/per-topic-storage-tier-within-one-cluster — the canonical deployment pattern Cloud Topics instantiates.
- systems/redpanda-cloud-topics-metastore — the Raft- replicated LSM key-value store that maps offsets to L1 object-storage positions. Canonicalised in the 2026-06-09 post.
- concepts/pluggable-persistence-layer — the metastore's core architecture: each LSM layer (WAL, SSTables, manifest) backed by a different substrate.
- patterns/raft-log-as-lsm-wal — the metastore's WAL is the Raft log itself; no separate durability layer.
- patterns/sstable-to-object-store-with-write-through-cache — metastore SSTables go to object storage + local write-through cache.
- patterns/manifest-via-raft-for-fast-failover — manifest replicated via Raft for instant leader takeover.
- patterns/metastore-bootstrap-from-object-storage — DR and read-replica bootstrap directly from object-storage metastore state.
- patterns/tiered-storage-to-object-store — the broader pattern family.
- patterns/concurrency-buffer-stage-for-high-latency-io — the upload-stage Little's-Law retrofit pattern. Canonicalised here.
- patterns/pipelined-produce-with-position-guarantee — the pre-existing per-connection pipelining technique that the upload-queue retrofit composes with.
- concepts/littles-law — the algebraic justification for the upload-queue retrofit. Cloud Topics is the wiki's canonical application instance.
- concepts/queue-depth-as-latency-hiding-mechanism — the conceptual underpinning of the upload-queue retrofit.
- concepts/storage-bottleneck-migration — the meta-context framing why the bottleneck recurred at the object-storage substrate.
- concepts/cluster-epoch — the load-bearing primitive of L0 GC: a monotonically increasing counter embedded in every L0 object ID at creation time. Canonicalised here from the 2026-05-19 post.
- concepts/epoch-based-distributed-gc — the GC technique Cloud Topics uses for L0 reclamation, distinct from reference counting. Canonicalised here.
- concepts/sliding-window-epoch-tracking — the per-partition state-machine relaxation that gives leadership-change tolerance for the GC admission-control rules. Canonicalised here.
- concepts/garbage-collection — the parent concept. Cloud Topics' L0 GC is now a third canonical instance alongside Magic Pocket blob GC and LSM tombstone GC.
- patterns/epoch-stamp-on-object-id-for-gc — the stamping half of the L0 GC mechanism. Canonicalised here.
- patterns/per-partition-rsm-for-gc-tracking — the per-partition replicated-state-machine half. Canonicalised here.
- patterns/lazy-aggregate-from-monotonic-local-state — the
global-aggregation half: clusterwide
M = min(M(p))computed from per-partition watermarks via existing metadata dissemination. Canonicalised here. - systems/openmessaging-benchmark — validation substrate.
- companies/redpanda — the company shipping the feature.
Write-request scheduler (adaptive upload parallelism)¶
The component that uploads L0 objects to S3 is the batcher. It aggregates write requests from many partitions into a single L0 object and issues one S3 PUT. The write-request scheduler (2026-06-18) controls how many upload streams run concurrently, adapting dynamically using a buddy-allocator algorithm:
- At startup, all shards form one group (maximum batching, 1× parallelism).
- Under heavy load, groups split in half (doubling upload streams) until the batcher keeps up.
- Under light load, groups merge back — converging to a single stream for cost efficiency.
Decisions are local: each group leader inspects its own backlog and its buddy's via cache-line-padded atomic counters. No global coordinator exists. This fits Seastar's thread-per-core constraint.
Cost impact: Per-shard batching (32 shards) costs ~$120K/year for a 5-broker cluster in PUT requests alone; the adaptive scheduler approaches single-shard costs (~$3,750/year) at low load while scaling throughput at high load (Source: sources/2026-06-18-redpanda-adaptive-write-request-scheduling).