Skip to content

GITHUB 2025-09-02 Tier 2

Read original ↗

Rearchitecting GitHub Pages

Summary

Around early 2015, GitHub Pages outgrew its single-machine active/standby origin. The original design ran the entire service on a single pair of machines with user data spread across 8 DRBD-backed partitions and an nginx map file — hostname → on-disk path — that was regenerated by a cron every 30 minutes. Problems: (a) 30-minute publish latency for new sites, (b) long nginx cold-start while loading the map from disk, (c) storage capped at what fit in one machine.

The 2015 rearchitecture keeps the design philosophy — simple components we understand, don't prematurely solve problems that aren't yet problems — but splits the stack into two tiers: a stateless nginx-based routing tier on Dell C5220s that looks up the destination fileserver pair in MySQL per request via an ngx_lua script, then proxy_passes to a stateful fileserver tier of Dell R720 pairs (still active/standby, still DRBD-replicated). Publishes are now instant because the routing table is live. Cold-restart problem gone because nginx isn't loading a giant pre-generated map. Storage scales horizontally by adding fileserver pairs.

The MySQL dependency is a deliberate availability trade-off, mitigated four ways: (1) per-request retries against different read replicas on query error; (2) 30-second in-memory cache of routing lookups in nginx shared memory to tolerate MySQL blips; (3) reads go to replicas so master failovers don't affect Pages; (4) Fastly sits in front caching all 200 responses, so even a total router outage leaves cached sites online. Performance: < 3 ms in Lua per request at the 98th percentile across millions of requests/hour, including external network calls. Production since January 2015 at the time of writing.

Key takeaways

  1. Pre-state — single pair, 8 DRBD partitions, 30-minute nginx map regen was the original architecture. "Even as Pages grew to serve thousands of requests per second to over half a million sites" the simple design held up, until storage ceiling + publish latency
  2. cold-restart cost forced a rewrite.
  3. Routing decision moved from static map to per-request DB lookup. The new frontend queries one of GitHub's MySQL read replicas via an ngx_lua script to look up which backend fileserver pair hosts a given Pages site, then uses stock proxy_pass to forward. Canonical instance of patterns/db-routed-request-proxy.
  4. nginx config is strikingly small. The production config core is ~8 lines: set $gh_pages_host + $gh_pages_path, run the Lua router in access_by_lua_file, set X-GitHub-Pages-Root, proxy_pass http://$gh_pages_host$request_uri (verbatim from post). ngx_lua's integration with nginx means they "reuse nginx's rock-solid proxy functionality rather than reinventing that particular wheel on our own."
  5. Availability cost of the MySQL dependency is explicit and mitigated four ways. (a) query retries reconnect to a different read replica on error; (b) nginx shared memory zones cache routing lookups on the pages-fe node for 30 seconds to reduce MySQL load + absorb blips; (c) reads go to replicas so master failovers don't take Pages down — "existing Pages will remain online even during database maintenance windows where we have to take the rest of the site down"; (d) Fastly in front caches all 200 responses so cached sites survive a total router outage — canonical instance of patterns/cdn-in-front-for-availability-fallback.
  6. Fileserver tier is unchanged in shape, only sharded. Each pair of Dell R720s runs the same active/standby DRBD-synchronous-replication setup as the old single pair, and GitHub was "able to reuse large parts of our configuration and tooling" for it. The migration is structurally patterns/horizontally-scale-stateful-tier-via-pairs — horizontal scaling achieved by adding pairs, not by rearchitecting the pair.
  7. Fileserver nginx is trivially simple. The backend sets document root to $http_x_github_pages_root (after a little bit of validation to thwart any path traversal attempts). All the routing state lives at the frontend.
  8. Performance numbers disclosed. Less than 3 ms of each request in Lua at the 98th percentile including time spent in external network calls across millions of HTTP requests per hour. The measurement boundary is the Lua pipeline as a whole — MySQL lookup included — which makes the < 3 ms a load-bearing datum for DB-routed proxy viability at scale.
  9. Instant publishes + no cold-restart + horizontal storage scaling are the three operational wins. New Pages sites publish as soon as the MySQL row is written; nginx no longer loads a giant map on restart; adding fileserver pairs extends storage capacity without rebuilding the routing frontend.

Architectural shape

client → Fastly (CDN, caches all 200s)
       → LB → pages-fe (nginx + ngx_lua router on C5220)
                │  access_by_lua_file /data/pages-lua/router.lua
                │  (queries MySQL read replica, caches in shared
                │   memory zone for 30s, retries on different
                │   replica on error)
                │  sets $gh_pages_host + $gh_pages_path
                │  sets X-GitHub-Pages-Root = $gh_pages_path
                ↓ proxy_pass http://$gh_pages_host$request_uri
              pages-fs pair (Dell R720, active/standby,
                             DRBD sync replication across 8 partitions,
                             nginx document root = X-GitHub-Pages-Root,
                             path-traversal validation)

Key properties of the pre/post split:

Dimension 2014 (pre) 2015 (post)
Routing primitive 30-min regenerated nginx map file per-request MySQL read-replica lookup, 30 s cache
Storage ceiling SSDs that fit in one machine N × fileserver pair
Publish latency up to 30 min instant
nginx restart cost high (load entire map) normal
Availability dependency local disk only MySQL read replicas + Fastly front cache
Blast radius on router outage total Fastly-cached 200s survive

Numbers disclosed

  • < 3 ms in Lua per request at the p98, including external network calls (MySQL).
  • Millions of HTTP requests per hour fleet-wide at time of post.
  • 30-second shared-memory cache TTL on routing lookups at pages-fe nodes.
  • 30-minute map regeneration cadence on the pre-2015 architecture (the main operational gripe).
  • Thousands of requests per second to over half a million sites served by the pre-2015 single-pair deployment.
  • 8 DRBD-backed partitions per fileserver pair (unchanged pre → post).
  • Hardware: Dell C5220 for frontend routers, Dell R720 for fileserver pairs (2015 hardware vintage).
  • Production since January 2015; post dated 2025-09-02 (text is an adaptation of a 2015 post; the 10-year-later republish on github.blog is why published is 2025).

Numbers NOT disclosed

  • Per-site origin-hit rate vs. Fastly cache-hit rate (no absolute offload ratio).
  • Number of fileserver pairs in production at the time of the post.
  • Routing-table cardinality (user count × site count — "over half a million sites" pre-2015, but not post-2015).
  • MySQL query rate at the pages-fe tier after the 30 s cache.
  • MySQL read-replica count + topology.
  • Failure rates during MySQL blips + frequency those occur.
  • Fastly cache hit-ratio under a router-outage scenario — the mitigation is structural, not measured.
  • Tail latency above p98.

Caveats

  • Post is an adaptation of GitHub's 2015 piece republished on github.blog in 2025; the architecture described is 10 years old. GitHub Pages has almost certainly evolved since — newer Fastly / routing / storage generations are not in scope. Ingest it as a historical-architecture datum, not current state.
  • No load-test methodology disclosed behind the < 3 ms p98 number.
  • No rollout-discipline details — how did GitHub migrate existing sites off the pre-2015 pair onto fileserver pairs? Blue-green / per-site cutover? Not covered.
  • DRBD's synchronous-replication trade-offs (latency ceiling on writes, split-brain handling, quorum semantics) not discussed.
  • Path-traversal validation on the fileserver ("a little bit of validation") is gestured at, not specified — no security posture statement.
  • Fastly-as-fallback property only holds for previously-cached pages + 200 responses; freshly-published or non-200 paths aren't covered by the outage tolerance.
  • "Availability dependency on MySQL" is framed as accepted; the actual observed Pages availability impact from MySQL incidents is not quantified.

Relationship to existing wiki

This post seeds the wiki's first canonical GitHub Pages entry + the first nginx / ngx_lua / DRBD / Fastly system pages. Complements existing MySQL coverage — previously framed as the start-small RDBMS that outgrew its workload shape (Canva case); now a second shape canonicalised: read-replica lookup inside a per-request routing decision, cache-softened and CDN-fronted. The post also connects to the wiki's broader routing corpus (Fly's fly-proxy, RIB/FIB ideas) at the DB-routed-proxy layer — GitHub Pages chose "read DB per request" where Fly chose "gossip state into every node's RIB then FIB-cache" — two different points on the routing-state-distribution axis.

No contradictions with existing wiki claims. Extends GitHub with a new pre-Engineering-blog-era deep-dive rooted in 2015 infrastructure.

Source

Last updated · 319 distilled / 1,201 read