PATTERN Cited by 1 source
Horizontally Scale Stateful Tier via Pairs¶
Shape¶
When you already have a stateful active/standby pair that works well but is hitting a single-pair storage / capacity ceiling, scale horizontally by running N pairs rather than redesigning the pair. Each pair keeps its internal active/standby replication (DRBD, semi-sync MySQL, etc.) exactly as before; you add a routing layer in front that assigns each unit of data (site, shard, tenant) to one pair and keeps track of the assignment.
routing tier (stateless) ─── assigns data-unit → pair
├── pair 1 (active + standby, same as before)
├── pair 2 (active + standby, same as before)
├── pair 3 (active + standby, same as before)
└── ...
Why it works¶
- Preserve known-good internals. The active/standby pair is something you've already debugged in production — failover, replication, tooling, runbooks. Keeping it intact means the rewrite is structural (adding a routing layer) rather than mechanical (replacing the storage layer). See GitHub's explicit report: "we were even able to reuse large parts of our configuration and tooling for the old Pages infrastructure on these new fileserver pairs due to this similarity".
- Capacity scales per-pair. Adding a pair adds storage + throughput linearly. The single-machine ceiling is gone.
- Blast radius stays bounded per-pair. A pair failure is still an active/standby failover, impacting only the data units assigned to that pair. You haven't traded clean fault-isolation for scale.
What it assumes¶
- The workload partitions cleanly — each unit of data lives on exactly one pair. Cross-pair transactions / queries aren't in the design. Static-site hosting, user-scoped data, tenant- scoped data, and any sharded OLTP workload fit this shape.
- You have a place to store the routing table — typically a central DB or config service that the routing tier consults. See patterns/db-routed-request-proxy.
- You have a place to store the pair-assignment decision — when a new data unit appears, something has to pick a pair. Usually a placement service or a simple "least-full pair" heuristic.
Key operational properties¶
- Capacity planning = pair count × per-pair capacity. Predictable; you rehearse failure of one pair and you know the blast radius.
- Rebalancing is hard. Moving a data unit between pairs requires data migration + routing-table update + cutover. This is the operational cost of the pattern — and the reason you want placement to be right on first assignment.
- Heterogeneous pairs are fine. Some pairs can be on newer hardware than others; the routing tier doesn't care.
Canonical wiki instance¶
GitHub Pages — 2015 rearchitecture.
The pre-2015 single active/standby pair served "thousands of
requests per second to over half a million sites" and was
fine except for the storage ceiling. The rewrite added a
DB-routed request proxy
layer in front and kept the per-pair design unchanged —
Dell R720s in active/standby with DRBD
synchronously replicating 8 partitions, nginx document root =
X-GitHub-Pages-Root, the same tooling GitHub used on the
pre-2015 pair. Source: "each pair is largely similar to the
single pair of machines that the old Pages infrastructure ran
on."
Source: sources/2025-09-02-github-rearchitecting-github-pages.
Trade-offs vs. alternatives¶
- vs. replace with a distributed store (S3, Ceph, etc.) — distributed stores scale further but introduce a new, more complex failure mode (consistency protocols, repair queues, erasure-coding semantics). Pair-based horizontal scaling is the right choice when your per-pair design is good and your workload partitions cleanly; you keep the operational model you already know.
- vs. one big machine — the path you just left; no longer viable once you hit the single-machine ceiling.
- vs. active/active with per-site routing — trades standby idleness for full capacity, but introduces conflict- resolution complexity. Pair-based is simpler at the cost of burning half the fleet on standby.
Related¶
- concepts/active-standby-replication — the pair shape this pattern scales out.
- concepts/synchronous-block-replication — replication semantic inside the pair (DRBD, canonically).
- systems/drbd, systems/github-pages — canonical stack.
- patterns/db-routed-request-proxy — required partner pattern for the routing layer.