Rearchitecting GitHub Pages¶
Summary¶
Around early 2015, GitHub Pages outgrew its
single-machine active/standby origin. The original design ran the
entire service on a single pair of machines with user data spread
across 8 DRBD-backed partitions and an
nginx map file — hostname → on-disk path — that was
regenerated by a cron every 30 minutes. Problems: (a) 30-minute
publish latency for new sites, (b) long nginx cold-start while loading
the map from disk, (c) storage capped at what fit in one machine.
The 2015 rearchitecture keeps the design philosophy — simple
components we understand, don't prematurely solve problems that aren't
yet problems — but splits the stack into two tiers: a stateless
nginx-based routing tier on Dell C5220s that looks
up the destination fileserver pair in MySQL per
request via an ngx_lua script, then proxy_passes
to a stateful fileserver tier of Dell R720 pairs (still
active/standby, still DRBD-replicated). Publishes are
now instant because the routing table is live. Cold-restart problem
gone because nginx isn't loading a giant pre-generated map. Storage
scales horizontally by adding fileserver pairs.
The MySQL dependency is a deliberate availability trade-off, mitigated
four ways: (1) per-request retries against different read replicas on
query error; (2) 30-second in-memory cache of routing lookups in
nginx shared memory to tolerate MySQL blips; (3) reads go to replicas
so master failovers don't affect Pages; (4) Fastly
sits in front caching all 200 responses, so even a total router
outage leaves cached sites online. Performance: < 3 ms in Lua per
request at the 98th percentile across millions of requests/hour,
including external network calls. Production since January 2015 at the
time of writing.
Key takeaways¶
- Pre-state — single pair, 8 DRBD partitions, 30-minute nginx map regen was the original architecture. "Even as Pages grew to serve thousands of requests per second to over half a million sites" the simple design held up, until storage ceiling + publish latency
- cold-restart cost forced a rewrite.
- Routing decision moved from static map to per-request DB
lookup. The new frontend queries one of GitHub's MySQL read
replicas via an
ngx_luascript to look up which backend fileserver pair hosts a given Pages site, then uses stockproxy_passto forward. Canonical instance of patterns/db-routed-request-proxy. - nginx config is strikingly small. The production config core is
~8 lines: set
$gh_pages_host+$gh_pages_path, run the Lua router inaccess_by_lua_file, setX-GitHub-Pages-Root,proxy_pass http://$gh_pages_host$request_uri(verbatim from post). ngx_lua's integration with nginx means they "reuse nginx's rock-solid proxy functionality rather than reinventing that particular wheel on our own." - Availability cost of the MySQL dependency is explicit and
mitigated four ways. (a) query retries reconnect to a different
read replica on error; (b) nginx
shared memory zonescache routing lookups on thepages-fenode for 30 seconds to reduce MySQL load + absorb blips; (c) reads go to replicas so master failovers don't take Pages down — "existing Pages will remain online even during database maintenance windows where we have to take the rest of the site down"; (d) Fastly in front caches all200responses so cached sites survive a total router outage — canonical instance of patterns/cdn-in-front-for-availability-fallback. - Fileserver tier is unchanged in shape, only sharded. Each pair of Dell R720s runs the same active/standby DRBD-synchronous-replication setup as the old single pair, and GitHub was "able to reuse large parts of our configuration and tooling" for it. The migration is structurally patterns/horizontally-scale-stateful-tier-via-pairs — horizontal scaling achieved by adding pairs, not by rearchitecting the pair.
- Fileserver nginx is trivially simple. The backend sets document
root to
$http_x_github_pages_root(after a little bit of validation to thwart any path traversal attempts). All the routing state lives at the frontend. - Performance numbers disclosed. Less than 3 ms of each request in Lua at the 98th percentile including time spent in external network calls across millions of HTTP requests per hour. The measurement boundary is the Lua pipeline as a whole — MySQL lookup included — which makes the < 3 ms a load-bearing datum for DB-routed proxy viability at scale.
- Instant publishes + no cold-restart + horizontal storage scaling are the three operational wins. New Pages sites publish as soon as the MySQL row is written; nginx no longer loads a giant map on restart; adding fileserver pairs extends storage capacity without rebuilding the routing frontend.
Architectural shape¶
client → Fastly (CDN, caches all 200s)
→ LB → pages-fe (nginx + ngx_lua router on C5220)
│ access_by_lua_file /data/pages-lua/router.lua
│ (queries MySQL read replica, caches in shared
│ memory zone for 30s, retries on different
│ replica on error)
│ sets $gh_pages_host + $gh_pages_path
│ sets X-GitHub-Pages-Root = $gh_pages_path
↓ proxy_pass http://$gh_pages_host$request_uri
pages-fs pair (Dell R720, active/standby,
DRBD sync replication across 8 partitions,
nginx document root = X-GitHub-Pages-Root,
path-traversal validation)
Key properties of the pre/post split:
| Dimension | 2014 (pre) | 2015 (post) |
|---|---|---|
| Routing primitive | 30-min regenerated nginx map file |
per-request MySQL read-replica lookup, 30 s cache |
| Storage ceiling | SSDs that fit in one machine | N × fileserver pair |
| Publish latency | up to 30 min | instant |
| nginx restart cost | high (load entire map) | normal |
| Availability dependency | local disk only | MySQL read replicas + Fastly front cache |
| Blast radius on router outage | total | Fastly-cached 200s survive |
Numbers disclosed¶
- < 3 ms in Lua per request at the p98, including external network calls (MySQL).
- Millions of HTTP requests per hour fleet-wide at time of post.
- 30-second shared-memory cache TTL on routing lookups at
pages-fenodes. - 30-minute map regeneration cadence on the pre-2015 architecture (the main operational gripe).
- Thousands of requests per second to over half a million sites served by the pre-2015 single-pair deployment.
- 8 DRBD-backed partitions per fileserver pair (unchanged pre → post).
- Hardware: Dell C5220 for frontend routers, Dell R720 for fileserver pairs (2015 hardware vintage).
- Production since January 2015; post dated 2025-09-02 (text is
an adaptation of a 2015 post; the 10-year-later republish on
github.blog is why
publishedis 2025).
Numbers NOT disclosed¶
- Per-site origin-hit rate vs. Fastly cache-hit rate (no absolute offload ratio).
- Number of fileserver pairs in production at the time of the post.
- Routing-table cardinality (user count × site count — "over half a million sites" pre-2015, but not post-2015).
- MySQL query rate at the pages-fe tier after the 30 s cache.
- MySQL read-replica count + topology.
- Failure rates during MySQL blips + frequency those occur.
- Fastly cache hit-ratio under a router-outage scenario — the mitigation is structural, not measured.
- Tail latency above p98.
Caveats¶
- Post is an adaptation of GitHub's 2015 piece republished on github.blog in 2025; the architecture described is 10 years old. GitHub Pages has almost certainly evolved since — newer Fastly / routing / storage generations are not in scope. Ingest it as a historical-architecture datum, not current state.
- No load-test methodology disclosed behind the < 3 ms p98 number.
- No rollout-discipline details — how did GitHub migrate existing sites off the pre-2015 pair onto fileserver pairs? Blue-green / per-site cutover? Not covered.
- DRBD's synchronous-replication trade-offs (latency ceiling on writes, split-brain handling, quorum semantics) not discussed.
- Path-traversal validation on the fileserver ("a little bit of validation") is gestured at, not specified — no security posture statement.
- Fastly-as-fallback property only holds for previously-cached pages +
200responses; freshly-published or non-200 paths aren't covered by the outage tolerance. - "Availability dependency on MySQL" is framed as accepted; the actual observed Pages availability impact from MySQL incidents is not quantified.
Relationship to existing wiki¶
This post seeds the wiki's first canonical GitHub Pages entry + the first nginx / ngx_lua / DRBD / Fastly system pages. Complements existing MySQL coverage — previously framed as the start-small RDBMS that outgrew its workload shape (Canva case); now a second shape canonicalised: read-replica lookup inside a per-request routing decision, cache-softened and CDN-fronted. The post also connects to the wiki's broader routing corpus (Fly's fly-proxy, RIB/FIB ideas) at the DB-routed-proxy layer — GitHub Pages chose "read DB per request" where Fly chose "gossip state into every node's RIB then FIB-cache" — two different points on the routing-state-distribution axis.
No contradictions with existing wiki claims. Extends GitHub with a new pre-Engineering-blog-era deep-dive rooted in 2015 infrastructure.
Source¶
- Original: https://github.blog/news-insights/rearchitecting-github-pages/
- Raw markdown:
raw/github/2025-09-02-rearchitecting-github-pages-2015-a7a7020e.md
Related¶
- systems/github-pages — the system this post rearchitects.
- systems/nginx — the web server at both tiers of the new stack.
- systems/ngx-lua — the embedded Lua runtime that hosts the routing decision inside nginx's request lifecycle.
- systems/drbd — synchronous block-replication between the active
- standby fileserver in each pair.
- systems/fastly — CDN fronting Pages; availability fallback for
cached
200s during a router outage. - systems/mysql — routing table; read replicas queried per request.
- concepts/active-standby-replication — fileserver HA shape, preserved pre → post.
- concepts/synchronous-block-replication — DRBD's replication semantic.
- patterns/db-routed-request-proxy — per-request destination lookup in a database instead of a static map.
- patterns/cdn-in-front-for-availability-fallback — cached 200s survive an origin-routing outage.
- patterns/cached-lookup-with-short-ttl — 30 s shared-memory cache on hot routing lookups.
- patterns/horizontally-scale-stateful-tier-via-pairs — scale the active/standby fileserver tier by adding pairs, not by redesigning the pair.
- companies/github — GitHub company page; this post lives under the platform-infrastructure lineage.