CONCEPT Cited by 1 source
Cell-based architecture¶
Definition¶
Cell-based architecture is the design pattern of partitioning a service into multiple independent deployable units ("cells") — each with its own compute, storage, and control path — such that a fault in one cell cannot escape to another. Each cell serves a subset of the workload; the system routes customer traffic to cells via a thin router. The goal is blast- radius reduction: a software bug, configuration change, dependency outage, or hot-spot can affect at most one cell's worth of customers.
Named verbatim in the AWS Well-Architected literature — "Reducing the scope of impact with cell-based architecture" — defined there as:
"A cell-based architecture uses multiple isolated instances of a workload, where each instance is known as a cell. Each cell is independent, does not share state with other cells, and handles a subset of the overall workload requests."
Why the pattern matters¶
- Software bug blast radius. A latent defect in a new version blows up at most one cell. Canary rollout across cells lets you detect before fleet-wide exposure.
- Cloud-provider dependency outage. If a shared external dependency (object store, managed DB, DNS, metadata service) has a regional outage, only the cells in that region are affected.
- Noisy-neighbor / hot-spot containment. A single customer's runaway workload saturates only its cell — the rest of the fleet is unaffected.
- Capacity-planning unit. Cells are sized + capacity-planned as a unit rather than as the entire service; operators know what a "full cell" means.
Canonical Redpanda instance¶
Redpanda's 2025-06-20 GCP-outage retrospective names cell-based architecture explicitly as its Redpanda Cloud design principle, distinct from but complementary to Data Plane Atomicity. Canonical verbatim (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):
"Redpanda Cloud clusters do not externalize their metadata or any other critical services. All the services needed to write and read data, manage topics, ACLs, and other Kafka entities are co-located, with Redpanda core leading the way with its single-binary architecture. This follows a well-known architectural pattern aimed at reducing the impact radius of failures, which also improves security. We have taken this pattern further and made it a product principle."
The Redpanda instance has two layers of cell isolation:
- Intra-cluster cell: single-binary Redpanda broker co-locates the Kafka API, Schema Registry, Kafka HTTP Proxy, and metadata — no per-service RPC fan-out within the cluster.
- Per-customer cell: each Redpanda Cloud cluster (especially in BYOC) runs in its own VPC with independent infrastructure — so a GCP outage affecting a subset of regions cannot cascade across customer boundaries.
Explicit contrast with "other products boasting centralized metadata and a diskless architecture" that "likely experienced the full weight of this global outage."
Relationship to neighboring concepts¶
- concepts/data-plane-atomicity — Data Plane Atomicity is the invariant (no runtime dependencies); cell-based architecture is the deployment shape that makes the invariant achievable. A cell that co-locates all services is structurally incapable of having cross-cell runtime dependencies.
- concepts/blast-radius — cells are a quantised blast- radius unit: an architecture with N cells has a maximum blast radius of 1/N of the fleet per cell-scoped fault.
- concepts/sharded-failure-domain-isolation — the PlanetScale framing of the same idea at database-sharding granularity; cells generalise from DB shards to whole-service shards.
- concepts/isolation-as-fault-tolerance-principle — cell- based architecture is one of the canonical realisations of the principle "small isolated units fail independently."
- patterns/shuffle-sharding — an additional-isolation tactic that can compose with cell-based architecture at the routing layer.
- concepts/static-stability — each cell is designed to operate statically-stable under the failure of its dependencies; cell-based architecture multiplies the property across the fleet.
- concepts/control-plane-data-plane-separation — a cell's data-plane is typically self-contained; the control-plane may be shared across cells or per-cell.
Contrast with monolithic / centralised architectures¶
Cell-based architecture is the structural opposite of the "shared centralised metadata" deployment shape typical of early managed-service designs:
| Axis | Cell-based | Centralised-metadata |
|---|---|---|
| Blast radius | 1/N of fleet per cell fault | Fleet-wide for metadata fault |
| Failure-mode correlation | Low (cells are independent) | High (shared metadata is SPOF) |
| Operational complexity | Higher (N cells to manage) | Lower (one service to manage) |
| Routing complexity | Requires cell router | Direct addressing |
| Upgrade risk | Canary cell-by-cell | Fleet-wide on every release |
The trade-off: cell-based designs pay higher steady-state operational complexity to buy lower blast-radius failures.
Deployment axes¶
Cell boundaries can map onto one or more of:
- Region / availability zone — the canonical multi-AZ deployment.
- Tenant / customer — each large customer gets its own cell.
- Sub-tenant grouping — small customers bucketed into shared cells; large customers to dedicated cells.
- Functional domain — the service is split into functional cells (e.g. read-cell vs write-cell, OLTP-cell vs OLAP-cell).
- Release channel — separate cells for stable / beta / experimental rollouts.
Caveats¶
- Router is the new SPOF. Cell-based architectures centralise risk in the cell router — a routing bug that misroutes traffic can break the pattern's guarantees. Router design (stateless, versioned, independently-deployed) is load-bearing.
- Cross-cell operations break the model. Aggregate queries, cross-tenant features, global admin actions require reaching multiple cells — each reach is a re-introduction of shared-fate.
- Sizing is an art. Cells too small = high operational cost; cells too large = insufficient blast-radius reduction.
- Cell-level feature uniformity is required. Different cells running different software versions creates cross-cell compatibility constraints that can bite customers who span cells.
- Capacity planning is harder. Cells size-stranded capacity is not trivially reusable across cells.
Seen in¶
- sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical Redpanda instance: single-binary + co-located-services cell per customer cluster, cited verbatim as the reason for zero-customer-impact during the 2025-06-12 GCP global outage.
Related¶
- systems/redpanda
- systems/redpanda-byoc
- concepts/data-plane-atomicity
- concepts/blast-radius
- concepts/sharded-failure-domain-isolation
- concepts/static-stability
- concepts/control-plane-data-plane-separation
- concepts/isolation-as-fault-tolerance-principle
- patterns/cell-based-architecture-for-blast-radius-reduction
- patterns/shuffle-sharding