Skip to content

PATTERN Cited by 1 source

Tiered storage to object store

Pattern

Split a stateful-broker system's storage into two tiers: a hot local tier (disk, pagecache) holding the most-recent data, and a cold remote tier (object storage — S3, GCS, Azure Blob) holding historical data. The broker tiers segments from hot to cold on a policy; both leader and follower brokers can serve historical reads directly from the object store, bypassing the local disk.

Canonical production instance: Apache Kafka's Tiered Storage feature (Early Access at time of Kozlovski's 2024-05-09 Kafka-101 explainer). Upstream: KIP-405.

Structural problem this solves

Kozlovski, sources/2024-05-09-highscalability-kafka-101:

"One of the architectural choices that become glaringly nonoptimal once Kafka took off was its decision to colocate the storage with the broker. Brokers host all of the data on their local disk, which brings a few challenges with it, especially at scale."

Four structural walls hit at ~10TB local disk per broker:

  1. Log recovery after ungraceful shutdown takes "hours if not days" rebuilding local log-index files over the whole disk. See concepts/log-recovery-time.
  2. Historical reads exhaust HDD IOPS. HDDs top out at ~120 IOPS/drive (cross-ref concepts/hdd-sequential-io-optimization). Tail consumers hit pagecache and are fine; historical consumers seek back into the disk and compete with producer writes for the 120-IOPS budget — performance tanks.
  3. Hard disk failure triggers full 10TB re-replication from the leader. That replication is itself a historical-read at the source; amplifies problem 2 across the cluster. One availability-zone-wide failure amplifies further.
  4. Partition reassignment copies whole replicas byte-for-byte, eating IOPS + wall-clock — one reason rebalancing a non-tiered Kafka cluster at scale is painful.

Shape of the fix

Tiered Storage breaks the 10TB-per-broker assumption:

"In this new mode, leader brokers are responsible for tiering the data into the object store. Once tiered, both leader and follower brokers can read from the object store to serve historical data." (Source: sources/2024-05-09-highscalability-kafka-101)

Consequences:

  • Local disk stays small. Only hot-tail segments live locally; log-recovery time is bounded by hot-tier size, not 10TB.
  • Historical reads don't exhaust local IOPS. They go to S3, which has its own (vast) IOPS budget.
  • Hard disk failure is less amplifying — the broker doesn't need to pull 10TB of historical data from a peer, most of it lives in S3 already.
  • Rebalancing copies only the hot tier; cold tier doesn't move.

Kozlovski cites 43% producer performance improvement when historical consumers were present in development tests — i.e., the competing-for-IOPS antagonism between producers and historical consumers is the exact shape Tiered Storage eliminates.

Cost angle: "Depending on the object store, this can result in saving cost too as you're outsourcing the replication and durability guarantees." Object stores include their own replication and erasure coding; broker-level replication of cold data becomes redundant.

Design-axis extrapolation → WarpStream

WarpStream takes this pattern to its logical endpoint: all data in S3, stateless brokers, no broker-to-broker replication. Kozlovski's framing of Kafka-ecosystem trajectory:

"WarpStream […] innovated with a new architecture that leverages S3 heavily, completely avoiding replication and broker statefulness." (Source: sources/2024-05-09-highscalability-kafka-101)

Tiered Storage is the moderate version of that idea inside upstream Kafka.

Trade-offs

  • Latency for historical reads is object-store latency (milliseconds, not microseconds). OK for cold consumers, potentially problematic for ones that need snappy replay.
  • Egress cost on public-cloud object stores at scale.
  • Operational complexity — two tiers, two failure modes, tiering-policy tuning.
  • Not a panacea for re-replication — brokers are still stateful on the hot tier (log segments, indexes); Cruise Control-style rebalancing is still needed for the hot tier.

Seen in

  • sources/2024-05-09-highscalability-kafka-101 — canonical wiki statement of Kafka Tiered Storage, the four structural walls it fixes, the 43% producer-improvement datapoint, and the extrapolation to WarpStream's S3-heavy design endpoint.
Last updated · 319 distilled / 1,201 read