PATTERN Cited by 1 source
Cell-based architecture for blast-radius reduction¶
The pattern¶
Partition the service into independent cells (self-contained deployable units with isolated data, compute, and control paths) so that any fault's impact is bounded to at most one cell's worth of users. Customers are routed to cells; no cell depends on another for serving customer traffic.
The pattern is a direct realisation of the cell-based architecture concept — the engineering activity that instantiates the principle in a concrete service. Canonicalised in AWS's Reducing scope of impact with cell-based architecture white paper.
The three minimum components¶
- Cell boundary — the physical/logical unit. Can be a region, an availability zone, a customer VPC, a Kubernetes namespace, or an entire standalone deployment. The boundary must be one where no shared infrastructure crosses cells at the data-plane.
- Cell router — the layer that maps requests to cells. Thin, stateless, independently deployed, versioned separately from cell code. The router is the single shared piece; its own reliability must be higher than any single cell.
- Per-cell deployment / config / release — each cell is operated as an independent unit. Rollouts advance cell-by-cell with feedback gating between cells.
Redpanda's instantiation¶
Canonical verbatim (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):
"Redpanda Cloud clusters do not externalize their metadata or any other critical services. All the services needed to write and read data, manage topics, ACLs, and other Kafka entities are co-located, with Redpanda core leading the way with its single-binary architecture. This follows a well-known architectural pattern aimed at reducing the impact radius of failures, which also improves security. We have taken this pattern further and made it a product principle."
Redpanda's cells:
- Intra-cluster: single-binary broker co-locates Kafka API + Schema Registry + Kafka HTTP Proxy + metadata. No per-service RPC fan-out.
- Per-customer cluster: each Redpanda Cloud cluster (especially in BYOC) runs in the customer's own VPC with independent infrastructure.
The payoff during the 2025-06-12 GCP outage: hundreds of customers' clusters were unaffected because they were structurally independent of one another and of any shared Redpanda-cloud metadata service.
Explicit anti-pattern contrast¶
The post names the opposite architectural choice explicitly:
"In contrast, other products boasting centralized metadata and a diskless architecture likely experienced the full weight of this global outage."
The "centralized metadata and diskless" shape trades low per- customer infrastructure footprint for high correlated-failure exposure. A single object-store outage or metadata-service outage affects every customer simultaneously. Cell-based architectures pay higher steady-state cost (N cells to operate, each with its own overhead) for lower correlated-failure risk.
Design decisions for the pattern¶
- Cell sizing. Too small → high operational overhead; too large → insufficient blast-radius reduction. Typical sizing: enough capacity that no single customer can saturate, few enough that the team can operate them.
- Customer-to-cell mapping. Options: sticky (by customer ID hash), load-balanced (by capacity), explicit (by contract tier). Sticky is the most common because it preserves cell- local state (caches, sessions, shard-aware routing).
- Router reliability. The router is the new single-point-of- failure; designing it for higher uptime than the cells themselves is load-bearing. Common tactics: stateless router, DNS-based failover, in-front-of-everything load balancer.
- Cross-cell operations. Aggregate queries, billing, analytics may need cross-cell data. These must be explicitly designed as async / eventually-consistent / tolerant of per-cell failure — otherwise they re-introduce correlation.
- Cell version skew. Rollouts advance cell-by-cell, so at any time cells run different versions. API contracts between cells and their clients must be versioned to handle the skew.
Composes with related patterns¶
- patterns/shuffle-sharding — an additional-isolation tactic at the router level: different customers share a different random subset of back-end resources, so one customer's fault can't take down any specific other customer's requests.
- concepts/data-plane-atomicity — the Redpanda invariant (no runtime dependencies) that cell-based architecture operationalises.
- concepts/blast-radius — the reliability property the pattern targets.
- concepts/sharded-failure-domain-isolation — the PlanetScale framing of the same idea at database-sharding granularity.
- concepts/feedback-control-loop-for-rollouts — per-cell canary rollouts use feedback control gating between cells.
Variants¶
- Regional cells: cells map 1:1 with cloud regions. Classic multi-region deployment.
- AZ cells: finer-grain, cells per AZ. Smaller failure domain, higher operational overhead.
- Tenant cells: cells map 1:1 with large enterprise tenants. Common in managed-service products (Redpanda BYOC).
- Sub-tenant cells: small customers share cells sized by aggregate traffic. Common for self-service tiers.
- Functional cells: cells map to service functions (OLTP cell, OLAP cell, batch cell). Used when workloads differ by resource profile.
Anti-patterns¶
- Cross-cell shared state. Any shared database, queue, or cache defeats the isolation guarantee.
- Global configuration flag that affects all cells simultaneously. Undoes cell-by-cell rollout gating.
- Router that does too much. A router that adds business logic adds failure modes; keep it thin.
- Manual cell routing. Customers hard-coded to a specific cell's endpoint create a flag-day cost when cells are rebalanced.
- Cell count that grows with customer count. If cell count is unbounded, operational overhead scales with customers — defeating the per-cell efficiency. Most architectures cap cell count and multi-tenant within each cell.
Caveats¶
- Not a substitute for in-cell reliability. Each cell must itself be fault-tolerant (redundant, replicated, rollback- capable). Cell-based architecture bounds correlated failures across cells, not per-cell failures.
- Steady-state cost is real. Multiplying the number of clusters, control planes, and rollouts is expensive. The trade is bought for blast-radius reduction.
- Operational discipline required. Rollouts must advance cell-by-cell; if operators get impatient and do fleet-wide pushes, the pattern's value evaporates.
- Cell-count disclosure is sparse. Most public vendor posts don't disclose cell count, sizing, or boundaries. Redpanda's post names the pattern but not specific numbers.
- BYOC is a natural cell boundary but not the only one. Non-BYOC Redpanda Cloud Dedicated / Serverless deployments have different cell topologies that the post doesn't explore.
Seen in¶
- sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical instance: Redpanda Cloud's single-binary + co- located-services cell-per-customer-cluster design credited as the reason the 2025-06-12 GCP outage had zero customer impact across hundreds of clusters. Framed as an AWS-WA-named pattern that Redpanda has "taken further and made a product principle."