Skip to content

SYSTEM Cited by 5 sources

Redpanda Cloud

What it is

Redpanda Cloud is Redpanda's managed Redpanda offering — the operator-run counterpart to Redpanda BYOC. Redpanda Cloud runs the broker cluster and the control plane inside Redpanda's own infrastructure on the customer's chosen cloud (AWS / GCP / Azure) in a Dedicated tier: each customer gets their own isolated cluster (not shared-multi-tenant), provisioned and operated by Redpanda.

Distinguish from BYOC, where the data plane runs in the customer's own cloud account / VPC and Redpanda runs only the control plane; Redpanda Cloud Dedicated runs both planes.

Canonical properties

  • Multi-cloud, multi-region. Redpanda Cloud clusters are available on AWS / GCP / Azure. Multi-AZ deployments within a region are the default HA shape; multi-region stretch clusters (concepts/multi-region-stretch-cluster) are available for RPO=0 regional-outage tolerance.
  • 99.99% availability SLA / ≥99.999% measured SLO on multi- AZ Redpanda Cloud clusters (concepts/customer-facing-sla; GCP specifically disclosed in the 2025-06-20 retrospective; same numbers apply broadly). Took ~2 years post-launch to achieve.
  • Replication factor ≥3 enforced. Customers can only increase the replication factor, not lower it below 3.
  • Local-NVMe primary + object-storage tiered secondary. Recent data lives on local NVMe disks on broker VMs; older segments tier to GCS / S3 / Azure Blob asynchronously. Tiered storage is not in the primary write/read path.
  • All core services are redundant. Kafka API, Schema Registry, and Kafka HTTP Proxy each run redundantly within the cluster.
  • Continuous chaos + load testing. Redpanda Cloud tiers' configurations are under continuous exercise (chaos engineering discipline).
  • Strict release-engineering with per-cloud tier certification. Throughput advertised per tier is certified per cloud provider.
  • Feedback-control-loop-monitored phased rollouts. Redpanda operations issue upgrades + cloud-infrastructure changes under feedback control loops and staged rollout discipline, "stopping when user-facing issues are detected."
  • Cell-based architecture at the cluster granularity (concepts/cell-based-architecture) — each cluster is a self-contained cell with no external-metadata critical-path dependencies.

Absent-externalisation property

Redpanda Cloud clusters have no additional critical-path dependencies other than the customer's VPC, compute nodes, and locally-attached disks (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage). The one customer-elected exception is when GCP Private Service Connect (or AWS PrivateLink-equivalent) is enabled for private client-to-broker connectivity — in that case the private-networking primitive becomes part of the critical path. See systems/gcp-private-service-connect.

This is the same Data Plane Atomicity property canonicalised on BYOC, applied at the Redpanda-operator fleet level.

2025-06-12 GCP outage behaviour

On 2025-06-12, GCP experienced a global outage triggered by an automated quota update to its API management system. Redpanda Cloud GCP clusters stayed stable across the disclosed "hundreds of clusters" in the fleet. Timeline and structural insulation analysis is canonicalised in the 2025-06-20 retrospective (sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage).

Key observed behaviour:

  • One cluster materially affected (a staging cluster in us-central-1): lost one node, replacement VM didn't return for ~2 hours (matching the regional-outage duration). Cluster survived without customer-visible impact thanks to replication-factor ≥3 + AZ spread.
  • Tiered-storage error rate elevated across the GCP fleet (increased PUT-request error rate to GCS observed from ~20:26 UTC on); absorbed without customer-visible impact because tiered storage is not the primary path + disk-reserve headroom was available.
  • Alerting vendor degraded but metrics stack healthy. Self-managed observability (metrics, logging) retained fleet-wide log-search; third-party dashboarding/alerting vendor was impacted (cascade) so notifications arrived delayed.
  • Preemptive SEV4 opened at 19:08 UTC per patterns/preemptive-low-sev-incident-for-potential-impact.
  • Non-critical marketplace-vendor cascade at 19:23 UTC (vendor impacted by Cloudflare's outage, which was in turn connected to GCP's) — classified and deferred.
  • Incident declared mitigated at 21:38 UTC with SEV unchanged; no negative customer impact on the disclosed production fleet.

Seen in

  • sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outageCanonical wiki introduction of Redpanda Cloud (the system page didn't exist before this ingest despite prior passing references). The 2025-06-20 retrospective is the first substantive Redpanda Cloud production-incident retro, documenting the fleet's behaviour through the 2025-06-12 global GCP outage. Canonicalises the system's SLA/SLO numbers, replication-factor floor, storage-tier split, absent-externalisation property, and operational-substrate disciplines (chaos testing, load testing, release engineering per-tier certification, feedback-control-loop- monitored phased rollouts). Also discloses the one customer- elected critical-path exception: GCP Private Service Connect.
  • sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc — positions Redpanda Cloud Dedicated vs Redpanda Cloud BYOC; Iceberg Topics GA on Dedicated (2025-04-07) → beta on BYOC (2025-05-13).
  • sources/2025-05-20-redpanda-implementing-fips-compliance-in-redpanda — FIPS compliance scope at publication excluded Redpanda Cloud (on roadmap); canonicalises one of the few capability-surface gaps between self-managed and Cloud deployments.
Last updated · 470 distilled / 1,213 read