Skip to content

CONCEPT Cited by 3 sources

Cell-based architecture

Definition

Cell-based architecture is the design pattern of partitioning a service into multiple independent deployable units ("cells") — each with its own compute, storage, and control path — such that a fault in one cell cannot escape to another. Each cell serves a subset of the workload; the system routes customer traffic to cells via a thin router. The goal is blast- radius reduction: a software bug, configuration change, dependency outage, or hot-spot can affect at most one cell's worth of customers.

Named verbatim in the AWS Well-Architected literature — "Reducing the scope of impact with cell-based architecture" — defined there as:

"A cell-based architecture uses multiple isolated instances of a workload, where each instance is known as a cell. Each cell is independent, does not share state with other cells, and handles a subset of the overall workload requests."

Why the pattern matters

  • Software bug blast radius. A latent defect in a new version blows up at most one cell. Canary rollout across cells lets you detect before fleet-wide exposure.
  • Cloud-provider dependency outage. If a shared external dependency (object store, managed DB, DNS, metadata service) has a regional outage, only the cells in that region are affected.
  • Noisy-neighbor / hot-spot containment. A single customer's runaway workload saturates only its cell — the rest of the fleet is unaffected.
  • Capacity-planning unit. Cells are sized + capacity-planned as a unit rather than as the entire service; operators know what a "full cell" means.

Canonical Redpanda instance

Redpanda's 2025-06-20 GCP-outage retrospective names cell-based architecture explicitly as its Redpanda Cloud design principle, distinct from but complementary to Data Plane Atomicity. Canonical verbatim (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):

"Redpanda Cloud clusters do not externalize their metadata or any other critical services. All the services needed to write and read data, manage topics, ACLs, and other Kafka entities are co-located, with Redpanda core leading the way with its single-binary architecture. This follows a well-known architectural pattern aimed at reducing the impact radius of failures, which also improves security. We have taken this pattern further and made it a product principle."

The Redpanda instance has two layers of cell isolation:

  1. Intra-cluster cell: single-binary Redpanda broker co-locates the Kafka API, Schema Registry, Kafka HTTP Proxy, and metadata — no per-service RPC fan-out within the cluster.
  2. Per-customer cell: each Redpanda Cloud cluster (especially in BYOC) runs in its own VPC with independent infrastructure — so a GCP outage affecting a subset of regions cannot cascade across customer boundaries.

Explicit contrast with "other products boasting centralized metadata and a diskless architecture" that "likely experienced the full weight of this global outage."

Relationship to neighboring concepts

  • concepts/data-plane-atomicity — Data Plane Atomicity is the invariant (no runtime dependencies); cell-based architecture is the deployment shape that makes the invariant achievable. A cell that co-locates all services is structurally incapable of having cross-cell runtime dependencies.
  • concepts/blast-radius — cells are a quantised blast- radius unit: an architecture with N cells has a maximum blast radius of 1/N of the fleet per cell-scoped fault.
  • concepts/sharded-failure-domain-isolation — the PlanetScale framing of the same idea at database-sharding granularity; cells generalise from DB shards to whole-service shards.
  • concepts/isolation-as-fault-tolerance-principle — cell- based architecture is one of the canonical realisations of the principle "small isolated units fail independently."
  • patterns/shuffle-sharding — an additional-isolation tactic that can compose with cell-based architecture at the routing layer.
  • concepts/static-stability — each cell is designed to operate statically-stable under the failure of its dependencies; cell-based architecture multiplies the property across the fleet.
  • concepts/control-plane-data-plane-separation — a cell's data-plane is typically self-contained; the control-plane may be shared across cells or per-cell.

Contrast with monolithic / centralised architectures

Cell-based architecture is the structural opposite of the "shared centralised metadata" deployment shape typical of early managed-service designs:

Axis Cell-based Centralised-metadata
Blast radius 1/N of fleet per cell fault Fleet-wide for metadata fault
Failure-mode correlation Low (cells are independent) High (shared metadata is SPOF)
Operational complexity Higher (N cells to manage) Lower (one service to manage)
Routing complexity Requires cell router Direct addressing
Upgrade risk Canary cell-by-cell Fleet-wide on every release

The trade-off: cell-based designs pay higher steady-state operational complexity to buy lower blast-radius failures.

Deployment axes

Cell boundaries can map onto one or more of:

  • Region / availability zone — the canonical multi-AZ deployment.
  • Tenant / customer — each large customer gets its own cell.
  • Sub-tenant grouping — small customers bucketed into shared cells; large customers to dedicated cells.
  • Functional domain — the service is split into functional cells (e.g. read-cell vs write-cell, OLTP-cell vs OLAP-cell).
  • Release channel — separate cells for stable / beta / experimental rollouts.

Caveats

  • Router is the new SPOF. Cell-based architectures centralise risk in the cell router — a routing bug that misroutes traffic can break the pattern's guarantees. Router design (stateless, versioned, independently-deployed) is load-bearing.
  • Cross-cell operations break the model. Aggregate queries, cross-tenant features, global admin actions require reaching multiple cells — each reach is a re-introduction of shared-fate.
  • Sizing is an art. Cells too small = high operational cost; cells too large = insufficient blast-radius reduction.
  • Cell-level feature uniformity is required. Different cells running different software versions creates cross-cell compatibility constraints that can bite customers who span cells.
  • Capacity planning is harder. Cells size-stranded capacity is not trivially reusable across cells.

Seen in

  • sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki instance of cell-based architecture with AWS accounts as cells in a three-level hierarchy (tier → cell → infra group). Each cell is an AWS account; cells compose via Route 53 weighted routing into tiers. First wiki instance of the cell-as-AWS-account pattern composed with in-account cluster-per-tenant isolation. The cell boundary here is bounded by AWS account-level quotas (ENIs, VPC endpoints) — different quota regime from Redpanda's per-customer-cluster cells. "AWS account limits on Elastic Network Interfaces (ENIs) and VPC endpoints constrain how many load balancers fit in a single account. This three-level hierarchy gives you two independent scaling levers to address each constraint — add infra groups to scale within an account and add cells to scale across accounts." See patterns/tier-cell-infra-group-hierarchy.
  • sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical Redpanda instance: single-binary + co-located-services cell per customer cluster, cited verbatim as the reason for zero-customer-impact during the 2025-06-12 GCP global outage.

  • sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failuresCanonical wiki production-AZ-outage instance for cells as a serverless-database regional composition unit. Verbatim: "Rather than running a single monolithic regional deployment, Lakebase composes a region from one or more identically shaped cells. A cell is a complete, self-contained slice of the Neon and Lakebase stack: Kubernetes, control plane, compute, and storage." The cell composition has two simultaneous purposes: (1) scaling unit"To grow a region, we add another cell. When an existing Cell approaches scalability limits of Kubernetes and control plane, new project creation is routed to a freshly provisioned Cell" — bounds the per-cell scaling pressure on Kubernetes + control-plane substrates; (2) blast-radius containment unit"Even with thorough testing and built-in protections, things still go wrong in production - Kubernetes control plane/system services trouble, code or config regressions, DoS situations etc. The cell boundary isolates faults and prevents the situation from spreading, leaving the other Cells in the region serving traffic normally." Production AZ-outage validation on 2026-05-08: "During an incident on May 8, 2026, when AWS experienced issues with an Availability Zone in us-east-1, one of the cells had issues failing over to healthy nodes. The impact was contained to that cell. The other seven cells in the region failed over correctly, so the incident affected only ~13% of databases in the region. In this case, the cell-based architecture reduced the impact by roughly an order of magnitude." The canonical-instance datums: 8 cells in us-east-1 (~13% impact = ~1/8 = the cell ratio); 7 cells failed-over correctly + 1 imperfectly during the AZ event; ~order-of-magnitude blast-radius reduction vs the monolithic-regional alternative. New composability with the scaling axis distinct from Redpanda's customer-cluster cells (which are scaling-bounded by AWS-account quotas, not by Kubernetes control-plane) and from AWS Hybrid Multi-Tenant cells (where AWS-account is the cell). Lakebase cells are at the intra-Kubernetes / intra-cloud-account altitude — multiple cells live in a single regional AWS account; the cell boundary is the Kubernetes cluster + control-plane + compute + storage stack. Three new wiki canonicalisations: (a) regional cell composition with cells as scaling-unit-and-blast-radius-unit jointly; (b) production AZ-outage-attainment-test of cell isolation with quantified ~13% / ~order-of-magnitude impact reduction; (c) integration with whole-AZ network-partition drill regimeconcepts/whole-az-network-partition-simulation + patterns/whole-az-network-partition-drill operate on one cell at a time so the drill blast-radius is itself cell-bounded.

Last updated · 542 distilled / 1,571 read