Skip to content

PATTERN Cited by 1 source

Object-store batched write with Raft metadata

Pattern

On the write path of a streaming broker, accept producer records into an in-memory multi-partition staging buffer; on a batch trigger (time or size), PUT the buffered bytes as a single object-storage file; then replicate a placeholder record through each involved partition's Raft log carrying only the object-storage location — and ack the producer once the placeholder is Raft-committed.

This splits: - Data bytes → object storage (S3 / GCS / ADLS), amortising per-PUT cost across many partitions and topics, and inheriting the storage-tier's multi-AZ durability guarantee from the cloud provider. - Record metadata / ordering / transactional coordination → Raft log (per-partition), preserving Kafka transactional and idempotency semantics unchanged from standard topics.

Problem

Standard Kafka / Redpanda topics replicate every produced byte through the broker's consensus layer to RF-1 peer brokers in other AZs. This is the cost axis cross-AZ replication bandwidth cost attacks: cross-AZ bandwidth charges often dominate broker operational cost on high-throughput, latency-tolerant workloads (observability, compliance, model-training data).

Naively writing every record to object storage instead eliminates cross-AZ cost but causes two new problems: 1. Small-file problem — many brokers × many partitions × frequent flush = PUT-request storm (concepts/small-file-problem-on-object-storage). 2. Loss of Kafka guarantees — external-metadata architectures (see systems/warpstream) typically re- implement transactional/idempotent produce semantics with different trade-offs.

Forces

  • Per-PUT cost dominates on many cloud-object-storage tiers — amortising PUTs across partitions is a major economic lever.
  • Kafka producer API expects synchronous ack with strong ordering/deduplication semantics — clients must not see a different failure envelope.
  • Durability must be established before ack — both the object-storage PUT and the metadata replication must succeed.
  • Cross-AZ bandwidth is on the critical path for traditional Raft-replicated payload; moving payload off this path is the economic prize.
  • Small batch windows (e.g., 250 ms) put a floor on produce p99 latency — the pattern is unsuitable for latency-critical workloads (which keep traditional NVMe-backed topics).

Solution

From the canonical Redpanda Cloud Topics implementation ( 2026-03-30 architecture deep-dive):

  1. In-memory cross-partition batching. Producer records enter the broker's Kafka API layer as usual but are routed to a Cloud Topics Subsystem buffer that aggregates records across all partitions and all topics for a short window ("e.g., 0.25 seconds or 4 MB").
  2. Object-storage PUT as an L0 file. The buffer flushes as a single file to cloud object storage. "We flush this batch directly to cloud object storage. We call this an L0 (Level 0) File."
  3. Placeholder replication via Raft. "Once the L0 file is safely durable in the cloud, we replicate a placeholder batch containing the location of the data to the corresponding Raft log for each batch involved in the upload." See concepts/placeholder-batch-metadata-in-raft.
  4. Producer ack. "Then we send an acknowledgement to the producer that the batch is safely persisted."
  5. Semantics preservation. "Because we still use the Raft log for this metadata, Cloud Topics inherit the same transaction and idempotency logic as our standard topics. The data payload lives in the cloud, but the guarantees live in Redpanda."

The critical design insight: Kafka's transactional/idempotency protocol only needs record metadata (offsets, sequence numbers, transactional control records) to flow through consensus in order — the record payload bytes can live anywhere.

Consequences

Positive

  • Cross-AZ cost collapses to object-store cost — the amortised PUT rate replaces per-byte cross-AZ egress. On volume-heavy streams the economics can shift dramatically.
  • Small-file problem avoided on the write path by cross-partition coalescing.
  • Kafka API compatibility retained — producers and consumers see identical protocol surface. Transactional producers, idempotent producers, exactly-once consumers all work.
  • Per-topic opt-in — the pattern is applied at the topic level (see patterns/per-topic-storage-tier-within-one-cluster), allowing mixed latency-critical + latency-tolerant topics in the same cluster.
  • Durability inherited from cloud storage tier — S3/GCS/ ADLS provide multi-AZ durability as a property of the service itself; the broker does not need to replicate the payload.

Negative

  • Batch-window latency floor — e.g., 250 ms makes this pattern unsuitable for latency-critical topics (payments, trading, cybersecurity — see concepts/latency-critical-vs-latency-tolerant-workload).
  • Dual-durability requirement — must wait for both PUT success and Raft commit before ack; either failure blocks acknowledgement.
  • Historical-read cost — L0 files contain mixed-partition data; historical reads require either a cache hit or a background rewrite into per-partition files (see patterns/background-reconciler-for-read-path-optimization).
  • Partial-failure recovery complexity — if the PUT succeeds but placeholder replication fails, the bytes are orphaned in object storage; the opposite failure leaves a dangling placeholder (neither case is explicitly addressed in the 2026-03-30 post).

Known uses

  • Redpanda Cloud Topics — canonical instance (GA in Redpanda Streaming 26.1). The only publicly-documented implementation using Raft-replicated placeholder batches for object-store-backed Kafka topic semantics at the time of writing.
  • WarpStream — related but distinct: metadata lives in an external store rather than an in-broker Raft log. Eliminates the Raft-commit step of this pattern in exchange for external-metadata dependency.
Last updated · 470 distilled / 1,213 read