PATTERN Cited by 4 sources

Tiered storage to object store¶

Pattern¶

Split a stateful-broker system's storage into two tiers: a hot local tier (disk, pagecache) holding the most-recent data, and a cold remote tier (object storage — S3, GCS, Azure Blob) holding historical data. The broker tiers segments from hot to cold on a policy; both leader and follower brokers can serve historical reads directly from the object store, bypassing the local disk.

Canonical production instance: Apache Kafka's Tiered Storage feature (Early Access at time of Kozlovski's 2024-05-09 Kafka-101 explainer). Upstream: KIP-405.

Structural problem this solves¶

Kozlovski, sources/2024-05-09-highscalability-kafka-101:

"One of the architectural choices that become glaringly nonoptimal once Kafka took off was its decision to colocate the storage with the broker. Brokers host all of the data on their local disk, which brings a few challenges with it, especially at scale."

Four structural walls hit at ~10TB local disk per broker:

Log recovery after ungraceful shutdown takes "hours if not days" rebuilding local log-index files over the whole disk. See concepts/log-recovery-time.
Historical reads exhaust HDD IOPS. HDDs top out at ~120 IOPS/drive (cross-ref concepts/hdd-sequential-io-optimization). Tail consumers hit pagecache and are fine; historical consumers seek back into the disk and compete with producer writes for the 120-IOPS budget — performance tanks.
Hard disk failure triggers full 10TB re-replication from the leader. That replication is itself a historical-read at the source; amplifies problem 2 across the cluster. One availability-zone-wide failure amplifies further.
Partition reassignment copies whole replicas byte-for-byte, eating IOPS + wall-clock — one reason rebalancing a non-tiered Kafka cluster at scale is painful.

Shape of the fix¶

Tiered Storage breaks the 10TB-per-broker assumption:

"In this new mode, leader brokers are responsible for tiering the data into the object store. Once tiered, both leader and follower brokers can read from the object store to serve historical data." (Source: sources/2024-05-09-highscalability-kafka-101)

Consequences:

Local disk stays small. Only hot-tail segments live locally; log-recovery time is bounded by hot-tier size, not 10TB.
Historical reads don't exhaust local IOPS. They go to S3, which has its own (vast) IOPS budget.
Hard disk failure is less amplifying — the broker doesn't need to pull 10TB of historical data from a peer, most of it lives in S3 already.
Rebalancing copies only the hot tier; cold tier doesn't move.

Kozlovski cites 43% producer performance improvement when historical consumers were present in development tests — i.e., the competing-for-IOPS antagonism between producers and historical consumers is the exact shape Tiered Storage eliminates.

Cost angle: "Depending on the object store, this can result in saving cost too as you're outsourcing the replication and durability guarantees." Object stores include their own replication and erasure coding; broker-level replication of cold data becomes redundant.

Design-axis extrapolation → WarpStream¶

WarpStream takes this pattern to its logical endpoint: all data in S3, stateless brokers, no broker-to-broker replication. Kozlovski's framing of Kafka-ecosystem trajectory:

"WarpStream […] innovated with a new architecture that leverages S3 heavily, completely avoiding replication and broker statefulness." (Source: sources/2024-05-09-highscalability-kafka-101)

Tiered Storage is the moderate version of that idea inside upstream Kafka.

Trade-offs¶

Latency for historical reads is object-store latency (milliseconds, not microseconds). OK for cold consumers, potentially problematic for ones that need snappy replay.
Egress cost on public-cloud object stores at scale.
Operational complexity — two tiers, two failure modes, tiering-policy tuning.
Not a panacea for re-replication — brokers are still stateful on the hot tier (log segments, indexes); Cruise Control-style rebalancing is still needed for the hot tier.

Seen in¶

sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more — per-topic extension of the tiered-storage pattern. Redpanda Cloud Topics (25.3 beta) turns tiered-storage-to-object-store into a per-topic selectable substrate within a single cluster: topic metadata stays in-broker (Raft-replicated), data is "passed straight through and written to cost-effective object storage (S3/ADLS/GCS)". The load-bearing new framing is cross-AZ replication bandwidth cost elimination — "virtually eliminates the cross-AZ network traffic associated with data replication" — motivated by the latency-critical vs latency-tolerant workload distinction (observability / compliance / model-training data doesn't need NVMe-primary multi-AZ-replicated durability). Extends the prior whole-cluster hot/cold tiering (Kafka Tiered Storage) and whole-cluster stateless-broker extreme (WarpStream) with an intermediate granularity: per-topic substrate choice. Canonical pattern: patterns/per-topic-storage-tier-within-one-cluster.
sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda — canonicalises a second axis of tiered storage beyond the capacity / cost framing: decommission and recommission speedup. Because cold partition data already lives in object storage, only the hot (yet-to-be-offloaded) segments need to re-replicate when a broker leaves or joins. Redpanda claims "orders of magnitude" faster than traditional on-broker-only storage. See concepts/tiered-storage-fast-decommission for the dedicated concept page.
sources/2024-05-09-highscalability-kafka-101 — canonical wiki statement of Kafka Tiered Storage, the four structural walls it fixes, the 43% producer-improvement datapoint, and the extrapolation to WarpStream's S3-heavy design endpoint.
sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platforms — Tiered storage as the economic precondition for replayability. Canonicalises a third axis beyond capacity and decommission-speed: "modern streaming engines can leverage tiered storage to offload cold data to object storage, meaning that you can keep full replayability without needing to plumb another data path. All of these auxiliary systems can become materialized views of the raw event stream." Composes with concepts/stream-replayability-for-iterative-pipelines — tiered storage is what makes long-horizon replay (for iterative RAG pipelines, embedding-model swaps, chunking- strategy experiments) economically tractable.

systems/kafka — the system this pattern ships inside.
systems/aws-s3 — canonical cold-tier substrate.
systems/warpstream — the pattern pushed to its extreme.
concepts/kafka-partition — the unit that gets tiered.
concepts/log-recovery-time — the wall #1 this fixes.
concepts/hdd-sequential-io-optimization — the wall-#2 IOPS framing.
concepts/pagecache-for-messaging — what stays hot on the local tier.