PATTERN Cited by 1 source
Background Reconciler for read-path optimization¶
Pattern¶
When the write path is optimised for ingest cost (e.g., cross-partition coalescing into object-storage PUTs), the resulting on-storage layout is rarely read-optimal. Rather than force the write path to serve both needs, run a background process — the Reconciler — that continuously rewrites ingest-layout files into read-layout files, and expose the boundary to readers as a single per-partition watermark so the read path can branch cheaply.
The canonical shape is:
- L0 files (ingest-layout): cross-partition, many small batches, quickly durable.
- L1 files (read-layout): per-partition-colocated, offset-sorted, larger.
- Reconciler: streams L0 → L1 in the background.
- Watermark (e.g., concepts/last-reconciled-offset): per-partition scalar marking the L0/L1 boundary; read path uses a single integer comparison to choose the appropriate path.
Problem¶
Object-storage-primary streaming brokers face a fundamental tension:
- Write-optimal — coalesce writes across partitions to minimise per-PUT cost (concepts/small-file-problem-on-object-storage).
- Read-optimal — colocate a partition's data into few, large, offset-sorted files so a historical consumer can seek once and stream contiguously.
Trying to satisfy both in one layout fails: write-side coalescing mixes partitions; read-side colocation requires per-partition files. If the write path adopts the read-optimal layout, PUT costs balloon; if it adopts the ingest-optimal layout, historical reads suffer a scattered-read problem (see concepts/l0-l1-file-compaction-for-object-store-streaming).
Forces¶
- Tailing consumers hit memory/cache, not object storage — so the write path's on-storage layout doesn't affect the hot read path.
- Cache-miss / catch-up consumers hit object storage — so the write-path layout becomes the consumer-visible layout exactly for this class of reads.
- Background rewrite costs egress + compute — the rewrite isn't free; it shifts cost from per-read scatter cost to per-byte one-time compaction cost.
- Garbage-collection latency — L0 files must live until L1 is durable + visible; GC cadence affects storage cost.
Solution¶
From the canonical Redpanda Cloud Topics implementation ( 2026-03-30 architecture deep-dive):
"The Reconciler continuously optimizes the storage layout. It reads the L0 files and reorganizes the data, grouping messages that belong to the same partition and writing them into L1 (Level 1) Files."
"L1 Files are: Much larger: Optimized for high-throughput object storage reading. Co-located: All data for a specific partition range is physically together. Sorted: Organized by offset."
"Once L0 data is successfully moved into L1, it's eligible for garbage collection and will eventually be removed."
Write path stays simple and cheap — it only knows about L0.
Reconciler runs continuously: 1. Reads L0 files. 2. Regroups records by partition. 3. Writes per-partition offset-sorted L1 files. 4. Commits the new L1 metadata to a shared metadata tier ("an internal topic and a key-value store"). 5. Advances the per-partition Last Reconciled Offset. 6. Eventually garbage-collects the L0 files whose data is now covered by L1.
Read path uses the watermark:
"When a consumer requests data, Redpanda routes the request based on where the data currently lives in its lifecycle. Each partition tracks a Last Reconciled Offset. Reads > Last Reconciled Offset: The system reads from L0 … Reads < Last Reconciled Offset: The system reads from L1."
Consequences¶
Positive
- Write path stays ingest-optimal — no in-broker sort, minimal PUT cost.
- Historical reads are sequential — hit L1 files which are per-partition and offset-sorted.
- Scattered-read window is bounded — the Reconciler's trailing edge defines the narrow window where L0-scatter reads can happen; cache usually covers it.
- Minimal read-path coordination — a single integer comparison against the watermark per fetch.
- Metadata separated by cadence — L0 metadata lives on the per-produce Raft path (placeholder batches); L1 metadata lives in a shared tier updated at compaction cadence.
Negative
- Background egress cost — Reconciler reads L0 and writes L1; both are object-storage operations. Every byte gets rewritten once. This shows up as storage-service read/write bandwidth cost.
- Storage amplification window — L0 and L1 both exist for the bytes in flight between them; garbage-collection lag directly inflates storage cost.
- Reconciler is a cluster-critical background process — if it falls far behind the producer rate, the scattered-read window grows and cache-miss reads degrade.
- Compaction-induced compute cost — see concepts/compression-compaction-cpu-cost.
- Metadata-tier coupling — L1 metadata lives in a shared tier; its availability bounds the availability of the L1 read path.
Known uses¶
- Redpanda Cloud Topics — canonical wiki instance (GA in Redpanda Streaming 26.1). The 2026-03-30 architecture deep-dive is the first detailed public description.
- Iceberg snapshot expiry / compaction — analogous on the lakehouse side: snapshot expiry and data-file compaction are also background processes rewriting small ingest-time files into read-friendly larger files. Cloud Topics' Reconciler is a streaming-broker altitude analogue of the same shape.
- LSM-tree compaction — same shape at a different altitude: write-path append to L0 memtables/sstables, background compaction into larger/older levels optimised for read amortisation.
Related¶
- systems/redpanda-cloud-topics — canonical production instance.
- systems/redpanda — the broker.
- systems/aws-s3 — the object-storage substrate; per-GET overhead motivates the L1 larger-file shape.
- concepts/l0-l1-file-compaction-for-object-store-streaming — the before/after file layout.
- concepts/last-reconciled-offset — the coordination watermark exposed to readers.
- concepts/small-file-problem-on-object-storage — the L1 rewrite is a read-side mitigation.
- concepts/iceberg-snapshot-expiry — lakehouse-side analogue.
- concepts/compression-compaction-cpu-cost — the compute-side cost framing.
- concepts/placeholder-batch-metadata-in-raft — the write-path metadata L1 eventually supersedes.
- patterns/object-store-batched-write-with-raft-metadata — the write-path companion pattern that produces L0.
- patterns/tiered-storage-to-object-store — broader family.