GITHUB 2025-09-02 Tier 2

Rearchitecting GitHub Pages¶

Summary¶

Around early 2015, GitHub Pages outgrew its single-machine active/standby origin. The original design ran the entire service on a single pair of machines with user data spread across 8 DRBD-backed partitions and an nginx map file — hostname → on-disk path — that was regenerated by a cron every 30 minutes. Problems: (a) 30-minute publish latency for new sites, (b) long nginx cold-start while loading the map from disk, (c) storage capped at what fit in one machine.

The 2015 rearchitecture keeps the design philosophy — simple components we understand, don't prematurely solve problems that aren't yet problems — but splits the stack into two tiers: a stateless nginx-based routing tier on Dell C5220s that looks up the destination fileserver pair in MySQL per request via an ngx_lua script, then proxy_passes to a stateful fileserver tier of Dell R720 pairs (still active/standby, still DRBD-replicated). Publishes are now instant because the routing table is live. Cold-restart problem gone because nginx isn't loading a giant pre-generated map. Storage scales horizontally by adding fileserver pairs.

The MySQL dependency is a deliberate availability trade-off, mitigated four ways: (1) per-request retries against different read replicas on query error; (2) 30-second in-memory cache of routing lookups in nginx shared memory to tolerate MySQL blips; (3) reads go to replicas so master failovers don't affect Pages; (4) Fastly sits in front caching all 200 responses, so even a total router outage leaves cached sites online. Performance: < 3 ms in Lua per request at the 98th percentile across millions of requests/hour, including external network calls. Production since January 2015 at the time of writing.

Key takeaways¶

Pre-state — single pair, 8 DRBD partitions, 30-minute nginx map regen was the original architecture. "Even as Pages grew to serve thousands of requests per second to over half a million sites" the simple design held up, until storage ceiling + publish latency
cold-restart cost forced a rewrite.
Routing decision moved from static map to per-request DB lookup. The new frontend queries one of GitHub's MySQL read replicas via an ngx_lua script to look up which backend fileserver pair hosts a given Pages site, then uses stock proxy_pass to forward. Canonical instance of patterns/db-routed-request-proxy.
nginx config is strikingly small. The production config core is ~8 lines: set $gh_pages_host + $gh_pages_path, run the Lua router in access_by_lua_file, set X-GitHub-Pages-Root, proxy_pass http://$gh_pages_host$request_uri (verbatim from post). ngx_lua's integration with nginx means they "reuse nginx's rock-solid proxy functionality rather than reinventing that particular wheel on our own."
Availability cost of the MySQL dependency is explicit and mitigated four ways. (a) query retries reconnect to a different read replica on error; (b) nginx shared memory zones cache routing lookups on the pages-fe node for 30 seconds to reduce MySQL load + absorb blips; (c) reads go to replicas so master failovers don't take Pages down — "existing Pages will remain online even during database maintenance windows where we have to take the rest of the site down"; (d) Fastly in front caches all 200 responses so cached sites survive a total router outage — canonical instance of patterns/cdn-in-front-for-availability-fallback.
Fileserver tier is unchanged in shape, only sharded. Each pair of Dell R720s runs the same active/standby DRBD-synchronous-replication setup as the old single pair, and GitHub was "able to reuse large parts of our configuration and tooling" for it. The migration is structurally patterns/horizontally-scale-stateful-tier-via-pairs — horizontal scaling achieved by adding pairs, not by rearchitecting the pair.
Fileserver nginx is trivially simple. The backend sets document root to $http_x_github_pages_root (after a little bit of validation to thwart any path traversal attempts). All the routing state lives at the frontend.
Performance numbers disclosed. Less than 3 ms of each request in Lua at the 98th percentile including time spent in external network calls across millions of HTTP requests per hour. The measurement boundary is the Lua pipeline as a whole — MySQL lookup included — which makes the < 3 ms a load-bearing datum for DB-routed proxy viability at scale.
Instant publishes + no cold-restart + horizontal storage scaling are the three operational wins. New Pages sites publish as soon as the MySQL row is written; nginx no longer loads a giant map on restart; adding fileserver pairs extends storage capacity without rebuilding the routing frontend.

Architectural shape¶

client → Fastly (CDN, caches all 200s)
       → LB → pages-fe (nginx + ngx_lua router on C5220)
                │  access_by_lua_file /data/pages-lua/router.lua
                │  (queries MySQL read replica, caches in shared
                │   memory zone for 30s, retries on different
                │   replica on error)
                │  sets $gh_pages_host + $gh_pages_path
                │  sets X-GitHub-Pages-Root = $gh_pages_path
                ↓ proxy_pass http://$gh_pages_host$request_uri
              pages-fs pair (Dell R720, active/standby,
                             DRBD sync replication across 8 partitions,
                             nginx document root = X-GitHub-Pages-Root,
                             path-traversal validation)

Key properties of the pre/post split:

Dimension	2014 (pre)	2015 (post)
Routing primitive	30-min regenerated nginx `map` file	per-request MySQL read-replica lookup, 30 s cache
Storage ceiling	SSDs that fit in one machine	N × fileserver pair
Publish latency	up to 30 min	instant
nginx restart cost	high (load entire map)	normal
Availability dependency	local disk only	MySQL read replicas + Fastly front cache
Blast radius on router outage	total	Fastly-cached 200s survive

Numbers disclosed¶

< 3 ms in Lua per request at the p98, including external network calls (MySQL).
Millions of HTTP requests per hour fleet-wide at time of post.
30-second shared-memory cache TTL on routing lookups at pages-fe nodes.
30-minute map regeneration cadence on the pre-2015 architecture (the main operational gripe).
Thousands of requests per second to over half a million sites served by the pre-2015 single-pair deployment.
8 DRBD-backed partitions per fileserver pair (unchanged pre → post).
Hardware: Dell C5220 for frontend routers, Dell R720 for fileserver pairs (2015 hardware vintage).
Production since January 2015; post dated 2025-09-02 (text is an adaptation of a 2015 post; the 10-year-later republish on github.blog is why published is 2025).

Numbers NOT disclosed¶

Per-site origin-hit rate vs. Fastly cache-hit rate (no absolute offload ratio).
Number of fileserver pairs in production at the time of the post.
Routing-table cardinality (user count × site count — "over half a million sites" pre-2015, but not post-2015).
MySQL query rate at the pages-fe tier after the 30 s cache.
MySQL read-replica count + topology.
Failure rates during MySQL blips + frequency those occur.
Fastly cache hit-ratio under a router-outage scenario — the mitigation is structural, not measured.
Tail latency above p98.

Caveats¶

Post is an adaptation of GitHub's 2015 piece republished on github.blog in 2025; the architecture described is 10 years old. GitHub Pages has almost certainly evolved since — newer Fastly / routing / storage generations are not in scope. Ingest it as a historical-architecture datum, not current state.
No load-test methodology disclosed behind the < 3 ms p98 number.
No rollout-discipline details — how did GitHub migrate existing sites off the pre-2015 pair onto fileserver pairs? Blue-green / per-site cutover? Not covered.
DRBD's synchronous-replication trade-offs (latency ceiling on writes, split-brain handling, quorum semantics) not discussed.
Path-traversal validation on the fileserver ("a little bit of validation") is gestured at, not specified — no security posture statement.
Fastly-as-fallback property only holds for previously-cached pages + 200 responses; freshly-published or non-200 paths aren't covered by the outage tolerance.
"Availability dependency on MySQL" is framed as accepted; the actual observed Pages availability impact from MySQL incidents is not quantified.

Relationship to existing wiki¶

This post seeds the wiki's first canonical GitHub Pages entry + the first nginx / ngx_lua / DRBD / Fastly system pages. Complements existing MySQL coverage — previously framed as the start-small RDBMS that outgrew its workload shape (Canva case); now a second shape canonicalised: read-replica lookup inside a per-request routing decision, cache-softened and CDN-fronted. The post also connects to the wiki's broader routing corpus (Fly's fly-proxy, RIB/FIB ideas) at the DB-routed-proxy layer — GitHub Pages chose "read DB per request" where Fly chose "gossip state into every node's RIB then FIB-cache" — two different points on the routing-state-distribution axis.

No contradictions with existing wiki claims. Extends GitHub with a new pre-Engineering-blog-era deep-dive rooted in 2015 infrastructure.

Source¶

Original: https://github.blog/news-insights/rearchitecting-github-pages/
Raw markdown: raw/github/2025-09-02-rearchitecting-github-pages-2015-a7a7020e.md

systems/github-pages — the system this post rearchitects.
systems/nginx — the web server at both tiers of the new stack.
systems/ngx-lua — the embedded Lua runtime that hosts the routing decision inside nginx's request lifecycle.
systems/drbd — synchronous block-replication between the active
standby fileserver in each pair.
systems/fastly — CDN fronting Pages; availability fallback for cached 200s during a router outage.
systems/mysql — routing table; read replicas queried per request.
concepts/active-standby-replication — fileserver HA shape, preserved pre → post.
concepts/synchronous-block-replication — DRBD's replication semantic.
patterns/db-routed-request-proxy — per-request destination lookup in a database instead of a static map.
patterns/cdn-in-front-for-availability-fallback — cached 200s survive an origin-routing outage.
patterns/cached-lookup-with-short-ttl — 30 s shared-memory cache on hot routing lookups.
patterns/horizontally-scale-stateful-tier-via-pairs — scale the active/standby fileserver tier by adding pairs, not by redesigning the pair.
companies/github — GitHub company page; this post lives under the platform-infrastructure lineage.