Skip to content

PATTERN Cited by 1 source

CDN in Front for Availability Fallback

Shape

Put a CDN in front of an origin, configured to cache successful responses. Primary purpose isn't latency or bandwidth — it's that when the origin is totally down, the CDN keeps serving the cached set. The cache is the outage-survivability boundary, not a performance layer.

client → CDN (caches 200s)
        ↓ cache-hit → serve from CDN
        ↓ cache-miss → origin (may be down)

If the origin is down, cache-hits still succeed; only cache-misses fail. For a service with a high hit ratio on a long-tail of URLs — static sites, published content, widely- shared API responses — this means most of the user-visible surface stays up during origin outages.

When to use

  • Origin has a high-impact outage mode (single-region, centralised control plane, dependency-coupled service).
  • Response surface is cacheable — deterministic responses for a URL + status code, not user-personalised.
  • You're willing to accept "stale for TTL" as the freshness floor on outage paths.
  • Freshly-published content being unavailable during origin outage is acceptable — the pattern doesn't cover cache misses.

What the pattern does NOT do

  • Freshly-published content during outage — if a site is published during the outage, its response isn't in the cache. This is definitional.
  • Non-200 paths — typical config caches only 200s. 404s, 5xxs, redirects pass through to origin and fail during outage.
  • Mutating requests — POST / PUT / DELETE aren't cached and propagate to origin.

These are intrinsic limits; the pattern is a boundary condition, not a replacement for origin availability.

Required operational posture

  1. Cache policy is part of the availability story — the Cache-Control + stale-while-revalidate + stale-if-error directives become load-bearing for uptime, not just perf. Code review them with that lens.
  2. Purge mechanics can't share the outage blast radius — if origin outage = purge outage (same control plane), you can't cut stale responses during an incident.
  3. Hit-ratio monitoring is an availability signal — a drop in hit ratio below the expected floor is an early warning that outage survivability has degraded.

Canonical wiki instance

GitHub Pages puts Fastly in front of the origin-side nginx routing tier. Per the source: "We also have Fastly sitting in front of GitHub Pages caching all 200 responses. This helps minimise the availability impact of a total Pages router outage. Even in this worst case scenario, cached Pages sites are still online and unaffected." Source: sources/2025-09-02-github-rearchitecting-github-pages.

The outage being mitigated is specifically the Pages router outage — which is also the point of MySQL coupling introduced by the DB-routed request proxy. Fastly-in-front is the compensating mechanism that lets the router take the MySQL dependency without it dominating the Pages availability number.

Trade-offs vs. alternatives

  • vs. just scaling the origin — origin scaling doesn't help against control-plane outages (DB down, routing bug deployed). Edge cache does.
  • vs. multi-region active-active origin — much more expensive, much more complex. Edge-cache fallback is the cheap step on the path; multi-region is the next step when cache-miss outage-survivability becomes the bottleneck.
  • vs. serve-stale-on-error from origin cache — same idea but co-located with the failure domain. Doesn't survive "origin totally unreachable" which is the case the pattern specifically targets.
Last updated · 517 distilled / 1,221 read