Skip to content

PATTERN Cited by 1 source

Horizontally Scale Stateful Tier via Pairs

Shape

When you already have a stateful active/standby pair that works well but is hitting a single-pair storage / capacity ceiling, scale horizontally by running N pairs rather than redesigning the pair. Each pair keeps its internal active/standby replication (DRBD, semi-sync MySQL, etc.) exactly as before; you add a routing layer in front that assigns each unit of data (site, shard, tenant) to one pair and keeps track of the assignment.

routing tier (stateless) ─── assigns data-unit → pair
        ├── pair 1 (active + standby, same as before)
        ├── pair 2 (active + standby, same as before)
        ├── pair 3 (active + standby, same as before)
        └── ...

Why it works

  • Preserve known-good internals. The active/standby pair is something you've already debugged in production — failover, replication, tooling, runbooks. Keeping it intact means the rewrite is structural (adding a routing layer) rather than mechanical (replacing the storage layer). See GitHub's explicit report: "we were even able to reuse large parts of our configuration and tooling for the old Pages infrastructure on these new fileserver pairs due to this similarity".
  • Capacity scales per-pair. Adding a pair adds storage + throughput linearly. The single-machine ceiling is gone.
  • Blast radius stays bounded per-pair. A pair failure is still an active/standby failover, impacting only the data units assigned to that pair. You haven't traded clean fault-isolation for scale.

What it assumes

  • The workload partitions cleanly — each unit of data lives on exactly one pair. Cross-pair transactions / queries aren't in the design. Static-site hosting, user-scoped data, tenant- scoped data, and any sharded OLTP workload fit this shape.
  • You have a place to store the routing table — typically a central DB or config service that the routing tier consults. See patterns/db-routed-request-proxy.
  • You have a place to store the pair-assignment decision — when a new data unit appears, something has to pick a pair. Usually a placement service or a simple "least-full pair" heuristic.

Key operational properties

  • Capacity planning = pair count × per-pair capacity. Predictable; you rehearse failure of one pair and you know the blast radius.
  • Rebalancing is hard. Moving a data unit between pairs requires data migration + routing-table update + cutover. This is the operational cost of the pattern — and the reason you want placement to be right on first assignment.
  • Heterogeneous pairs are fine. Some pairs can be on newer hardware than others; the routing tier doesn't care.

Canonical wiki instance

GitHub Pages — 2015 rearchitecture. The pre-2015 single active/standby pair served "thousands of requests per second to over half a million sites" and was fine except for the storage ceiling. The rewrite added a DB-routed request proxy layer in front and kept the per-pair design unchanged — Dell R720s in active/standby with DRBD synchronously replicating 8 partitions, nginx document root = X-GitHub-Pages-Root, the same tooling GitHub used on the pre-2015 pair. Source: "each pair is largely similar to the single pair of machines that the old Pages infrastructure ran on." Source: sources/2025-09-02-github-rearchitecting-github-pages.

Trade-offs vs. alternatives

  • vs. replace with a distributed store (S3, Ceph, etc.) — distributed stores scale further but introduce a new, more complex failure mode (consistency protocols, repair queues, erasure-coding semantics). Pair-based horizontal scaling is the right choice when your per-pair design is good and your workload partitions cleanly; you keep the operational model you already know.
  • vs. one big machine — the path you just left; no longer viable once you hit the single-machine ceiling.
  • vs. active/active with per-site routing — trades standby idleness for full capacity, but introduces conflict- resolution complexity. Pair-based is simpler at the cost of burning half the fleet on standby.
Last updated · 517 distilled / 1,221 read