Skip to content

CONCEPT Cited by 1 source

Tiered storage as primary fallback

Definition

Tiered storage as primary fallback is the architectural property where a streaming broker / database stores primary (write-path + recent-read-path) data on local NVMe and offloads older data asynchronously to cheaper object storage — such that primary-path availability does not depend on the tiered-storage layer. An outage or error spike in the object-storage layer degrades historical-read UX (slower / failed reads of cold data) but does not block writes or recent reads.

The canonical Redpanda verbatim (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):

"Redpanda stores the primary data on local NVMe disks and sends older data to tiered storage, asynchronously."

Why this matters during cloud-provider outages

Object-storage outages (S3, GCS, Azure Blob) are a canonical cross-service-correlation failure shape: many services in a cloud region depend on the object store, and its outage cascades across them. A streaming broker that requires object storage on the write path has availability coupled to the object store; a broker that offloads asynchronously can remain writable.

The 2025-06-12 GCP outage validated this property for Redpanda: "we noticed an increase in tiered storage errors, which is not Redpanda's primary storage. We didn't get high disk utilization alerts, which we typically receive when the tiered storage subsystem has been experiencing issues for an extended period (days)." The error-rate spike was visible but customer-invisible because the write path did not depend on the failing tier.

The inversion: tiered storage as primary write path

The architectural opposite is "centralized metadata and a diskless architecture" — a streaming broker where the primary data path writes to object storage. This design has different trade-offs:

Property Local NVMe primary Object-store primary (diskless)
Write latency Low (local disk) Higher (network + object-store write)
Per-byte storage cost Higher (NVMe) Lower (object storage)
Write-path availability Independent of object store Coupled to object store
Scale-to-zero Bounded by local disk Unbounded (stateless brokers)
Recovery time Fast (local data) Depends on re-hydration from object store
Correlated-failure risk Low (local) High (shared object store substrate)

Redpanda's 2025-06-20 retrospective contrasts the two explicitly: "In contrast, other products boasting centralized metadata and a diskless architecture likely experienced the full weight of this global outage."

Redpanda's three-tier storage hierarchy

  1. Local NVMe — write path + recent reads (leader + follower replicas).
  2. Used-but-reclaimable disk space — NVMe space used for caching that can be reclaimed on demand (see concepts/unused-reclaimable-disk-buffer).
  3. Tiered storage (object storage) — older data flushed asynchronously; cold-path reads only.

The async flush is the load-bearing property: writes ack on local NVMe replication, not on object-store confirmation. Object-store errors back up the flush queue rather than blocking writes.

Downstream effect of the 2025-06-12 GCP outage

Per the post, the tiered-storage error spike manifested as:

  • Elevated PUT error rates on GCS (object store) during the outage window.
  • Some latency impact on certain API calls (unspecified percentile).
  • No high-disk-utilization alerts — the flush backlog didn't grow long enough to saturate the NVMe buffer.
  • No write-path impact — brokers kept acknowledging producer writes on local NVMe replication.

The absence of disk-utilization alerts ("which we typically receive when the tiered storage subsystem has been experiencing issues for an extended period (days)") is the quantitative boundary: the outage was short enough that the buffer absorbed the backlog.

Reliability composes with other primitives

Caveats

  • Applies to streaming / append-heavy workloads. A random- access OLTP database can't use this shape as easily — the local disk must fit the working set.
  • Async flush means bounded latency to cold reads. A consumer that falls behind by longer than the NVMe retention window has to read from the (cheaper but slower + error-prone) tiered storage. Outages amplify this.
  • Tier promotion / demotion policy matters. Aggressive promotion-to-tiered = more dependency on object store; conservative = larger NVMe footprint.
  • Object-store errors can eventually matter. If the outage runs long enough to fill the unused + reclaimable buffer, the write path does get impacted. The Redpanda post is explicit: "Additionally, as a reliability measure, we leave disk space unused and used-but-reclaimable (for caching), which we can reclaim if the situation warrants it. This outage was not that situation." — i.e., the reserve was sufficient this time.
  • Not a substitute for retries / backpressure. Elevated object-store error rates still require retry discipline in the flush loop; the architecture absorbs the delay, not the failure.
  • BYOC deployment layer matters. In BYOC the object store is the customer's own bucket; BYOC partially shields Redpanda- managed cluster availability from cross-customer object-store outages.

Seen in

Last updated · 470 distilled / 1,213 read