Skip to content

REDPANDA 2025-06-21

Read original ↗

Redpanda — Behind the scenes: Redpanda Cloud's response to the GCP outage

Summary

Redpanda (unsigned, 2025-06-21) publishes a production-incident retrospective of the 2025-06-12 Google Cloud Platform global outage — triggered by "an automated quota update to their API management system" — from Redpanda Cloud's perspective, across the ~3-hour incident window 18:41–21:38 UTC. Framed explicitly as a "Redpanda-Cloud-emerged-fine" story while disclosing the one affected cluster (a staging cluster in us-central-1 that lost a node with no replacement coming back during the outage, ~2 hours to replace). Load-bearing framing: "This post provides a brief timeline of events from our own experience, our response, previously untold details about Redpanda Cloud, and closing thoughts on safety and reliability practices in our industry."

Canonicalises four load-bearing substrate primitives the wiki had been using implicitly:

  1. Cell-based architecture as Redpanda Cloud's explicit product principle — distinct from but complementary to Data Plane Atomicity (which canonicalises the invariant; cell-based architecture canonicalises the deployment shape). Canonical verbatim: "Redpanda Cloud clusters do not externalize their metadata or any other critical services. All the services needed to write and read data, manage topics, ACLs, and other Kafka entities are co-located, with Redpanda core leading the way with its single-binary architecture."
  2. Butterfly effect as a first-class system-design primitive from chaos theory — "GCP's seemingly innocuous automated quota update triggered a butterfly effect that no human could have predicted." Names the non-linearity of complex systems (output change not proportional to input change) as the axiom that forces the named reliability practices: closing feedback control loops, phasing change rollouts, shedding load, applying backpressure, randomizing retries, defining incident response processes.
  3. Feedback control loops in change management — the control-theory framing for phased rollouts verbatim: "As operations are issued, such as Redpanda or cloud infrastructure upgrades, we try to close our feedback control loops by watching Redpanda metrics as the phased rollout progresses and stopping when user-facing issues are detected."
  4. Third-party dashboarding/alerting as partial-failure-mode observability dependency — Redpanda self-hosts observability data but uses a third-party for dashboarding + alerting. During the outage the third-party was partially affected, yielding degraded but usable observability: metrics queryable via self-managed substrate, but alert notifications lost. "We deemed the loss of alert notifications not critical since we were still able to assess the impact through other means."

Plus four patterns that this post canonicalises:

  1. Preemptive low-severity incident on elevated-risk signal — at 19:08 UTC Redpanda declared SEV4 "in preparation for the worst" before any customer impact was observed. Incident declaration as risk-management bookkeeping, not as a response to confirmed user-facing harm.
  2. Proactive customer outreach on elevated error rate — at 20:56 UTC Redpanda reached out to "customers with the highest tiered storage error rates to ensure we were not missing anything, and also to show our support" — despite having full visibility into the BYOC clusters it manages on behalf of customers ("we know the answers to the questions, but we ask anyway").
  3. Cell-based architecture as blast-radius reduction — the AWS Well-Architected-named "architectural pattern aimed at reducing the impact radius of failures, which also improves security" instantiated as Redpanda's single-binary + co-located-services shape. Explicit contrast: "other products boasting centralized metadata and a diskless architecture likely experienced the full weight of this global outage."
  4. Hedged observability stack via self-hosted data + third-party UI — splitting the observability tier so one vendor's outage only degrades (dashboarding/alerting) rather than blinds (metric/log collection). Load-bearing self-disclosure verbatim: "Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale."

Two secondary substrate primitives canonicalised:

  1. Tiered storage as fallback, not primary"Redpanda stores the primary data on local NVMe disks and sends older data to tiered storage, asynchronously." Local NVMe is the availability-critical path; tiered storage errors degrade historical-read UX but cannot block writes.
  2. Unused + used-but- reclaimable disk space as reliability buffer"as a reliability measure, we leave disk space unused and used-but-reclaimable (for caching), which we can reclaim if the situation warrants it." The reserve is deliberate, not incidental.

Key takeaways

  • Zero customer impact from a global cloud outage on a managed streaming product. Hundreds of Redpanda Cloud GCP clusters remained healthy through GCP's ~3-hour global API-management outage. The single affected cluster was a staging cluster in us-central-1 — the same region that saw the longest GCP recovery. Customer's production cluster was unaffected. Per the post: "Out of hundreds of clusters, we were lucky that only one cluster was affected."
  • Cell-based architecture is the causal mechanism. Per the post's explicit framing, Redpanda's single-binary + co-located-services shape is the cell-based architecture instantiation that kept the blast radius bounded — "Redpanda Cloud clusters do not externalize their metadata or any other critical services." Load-bearing contrast with "centralized-metadata + diskless" streaming-broker designs.
  • SLA design trades visible cost for invisible reliability. The 4× availability SLA (99.99% headline, ≥99.999% design target) is enforced by six concrete substrate choices: (a) replication factor ≥ 3 on all topics; customers cannot lower, only raise. (b) Local NVMe primary + asynchronous tiered storage for older segments. (c) Redundant Kafka API / Schema Registry / Kafka HTTP Proxy services. (d) No critical-path dependencies beyond the VPC, compute nodes, and locally attached disks — except when Private Service Connect (PSC) is enabled, in which case PSC becomes critical-path. (e) Continuous chaos-testing + load-testing of each cluster tier's configurations. (f) Release-engineering discipline that certifies advertised tier throughput in each cloud provider, with feedback-control- loop-guarded phased rollouts.
  • Incident timeline: 18:41 → 21:38 UTC. Notified by GCP TAM 18:41. Began impact assessment 18:42. Noticed degraded monitoring (third-party dashboarding/alerting partially affected) 18:43. Preemptive SEV4 declared 19:08 "in preparation for the worst" despite no customer impact and no customer tickets — classic preemptive-SEV pattern instance. Cloud-marketplace vendor reported issues 19:23 (the Cloudflare 2025-06-12 outage cascade; Redpanda "put it on the waiting list" — non-critical degraded dependency). Google identified + mitigated triggering cause 19:41. Delayed tiered-storage-error alerts arrived 20:26. Proactive customer outreach at 20:56. Incident considered mitigated 21:38 with SEV unchanged (SEV4) and no evidence of negative customer impact.
  • Redpanda customer context changed the urgency dial. "For some of them, GCP's Pub/Sub served as the data source for their Redpanda BYOC clusters, so they needed to recover that first. While this meant Redpanda's operational status was less critical in those cases, it was still one less element for them to worry about." Redpanda's position in the customer's stack (downstream of GCP-native sources) meant upstream outages limited even a counterfactually-affected Redpanda's urgency.
  • Observability-vendor split paid dividends. The 2024 decision to migrate observability data + collection self-hosted (keeping third-party only for dashboarding + alerting) meant the Cloudflare-triggered dashboarding/alerting outage only degraded Redpanda's observability posture — metric queries still worked, just delayed alerts. If the full observability stack had remained on the affected third-party, "we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale." Canonicalises hedged observability stack as substrate-level reliability primitive distinct from application-level redundancy.
  • Butterfly effect named explicitly as design axiom. "Modern computer systems are complex systems — and complex systems are characterized by their non-linearity, which means that observed changes in an output are not proportional to the change in the input." Post frames GCP's automated-quota-update as a canonical complex-systems non-linearity, citing chaos-theory / systems-thinking directly. The named mitigations are the wiki's broader reliability corpus: closing feedback control loops, phasing change rollouts, shedding load, applying backpressure, randomizing retries, defining incident response processes.
  • Single-node-loss mechanism disclosed. "During its incident analysis, we found evidence that the GCP outage was a contributing factor in losing one node and having no replacement coming back. However, this event was isolated to us-central-1 and an uncommon interaction between internal infrastructure components of the cluster. Out of hundreds of clusters, we were lucky that only one cluster was affected. It took GCP around two hours to launch the replacement node, roughly the duration of the outage in us-central-1." The "uncommon interaction between internal infrastructure components" is not walked. The mitigation (staging vs production cluster) was customer-side, not architectural.
  • Industry framing: CrowdStrike parallel. Closing thoughts name the 2024 global CrowdStrike outage as evidence that "similar controls were missing to enable safer global rollouts" are an industry-wide gap. Argument: safer global rollouts require control theory in change management tools + systems- thinking awareness of non-linearity as an engineering default.

Operational numbers

  • Hundreds of GCP clusters healthy through the outage (no specific count disclosed).
  • One affected cluster — staging, us-central-1, single node loss + failed replacement.
  • ~2 hours to replace the lost node (matching GCP's us-central-1 recovery duration).
  • Zero customer support tickets received for the duration.
  • 99.99% availability SLA (4× nines) headline.
  • ≥99.999% availability SLO (5× nines) design target.
  • Replication factor ≥ 3 enforced on all topics; cannot be lowered by customers.

Caveats

  • Vendor-voice retrospective. Unsigned, published 9 days post-incident; no third-party review.
  • Hindsight-bias acknowledged in-post"Acknowledging the risk of hindsight bias" — but still vendor-framed.
  • Single-affected-cluster mechanism underspecified. "An uncommon interaction between internal infrastructure components of the cluster" is the closest to a root cause disclosed for the node loss + failed replacement.
  • No quantitative metrics on the tiered-storage error rate elevation — post names "an increase in tiered storage errors, which is not Redpanda's primary storage" and "PUT requests were dominant" but does not disclose error-rate percentile or duration.
  • PSC exception to no-critical-path-dependencies is undersized. The footnote ("Except when Private Service Connect (PSC) is enabled, in this case, the PSC becomes part of the critical path for reading and writing data to Redpanda") is load-bearing but not walked — it's the one deployment shape where the "no external dependencies" claim does not hold.
  • Third-party observability vendor unnamed. The post says "a third-party provider for dashboarding and alerting needs" and later says "the vendor we use for managing cloud marketplaces" was affected by the Cloudflare outage — neither vendor named.
  • Disk-reserve percentage undisclosed"as a reliability measure, we leave disk space unused and used-but-reclaimable (for caching)" states the invariant but not the sizing policy.
  • Phased-rollout-with-feedback-control-loop mechanism underspecified. The sentence canonicalises the intent but not the implementation (gate metrics, cohort sizing, pause thresholds, rollback procedure).
  • Cell-boundary mechanism underspecified at the Cloud level. Single-binary co-location is well-canonicalised on systems/redpanda; the question of whether different customers' clusters share any cell-level infrastructure (network, control plane, BYOC deployer) is not addressed.
  • CrowdStrike + control-theory closing thoughts are aspirational"it seems valuable and timely to reconsider our current mindset, and I cannot think of anything better than a systems thinking mindset... which should also result in increased adoption of control theory in our change management tools" — no specific product, protocol, or standard is named.

Source

Last updated · 470 distilled / 1,213 read