Skip to content

CONCEPT Cited by 1 source

Connection failure rate

Connection failure rate is a metric measuring how often a reverse proxy or CDN edge fails to establish a successful connection to an origin server when attempting to retrieve uncacheable content or content not in (or expired from) cache. It is a third-party uptime signal for whichever system hosts the origin — web site, cloud region, on-prem data center — derived from real proxy traffic rather than from synthetic probes.

The canonical definition is from the Cloudflare 2026-04-28 Q1 2026 disruption review:

"Connection failures occur when Cloudflare fails to successfully connect to an origin server when attempting to retrieve uncacheable content, or content not in/expired from cache."

Why cache-miss paths are the right measurement surface

A reverse proxy serves most requests from cache. Cache hits never touch the origin and therefore don't carry origin-health signal. The connection failure rate is measured on the subset of traffic that must go to the origin:

  • Uncacheable content (Cache-Control: no-store, Cache-Control: private, POST/PUT/DELETE requests).
  • Content not in cache (cold / low-popularity / new).
  • Content whose cache entry has expired (max-age elapsed, revalidation needed).

For these requests, the edge opens a fresh TCP + TLS connection to the origin. Each successful or failed connection attempt contributes to the rate.

The third-party uptime role

Cloud providers publish their own status dashboards, but those reflect the provider's internal view of whether services are operational. The connection-failure-rate signal reflects the external-caller's view of whether the provider's origin is reachable:

  • The provider's own health checks may pass while callers timeout — network-path failure the provider doesn't see.
  • The provider's dashboard may lag the real failure (internal aggregation + editorial review delays vs live traffic drop).
  • The provider's dashboard may omit failures it considers customer-configuration-related even when they are symptoms of a platform issue.

During the March 2026 drone strikes on AWS Middle East facilities (the canonical case in the sysdesign-wiki corpus), Cloudflare Cloud Observatory showed "elevated connection failure rates" for me-central-1 and me-south-1 "beginning March 1-2 and remaining higher for multiple days." This was independent, live, time-pinnable evidence of the regional degradation — a neutral evidence base that didn't depend on Amazon's own disclosure cadence.

What it can and cannot detect

It can detect:

  • Origin unreachability from the reverse proxy's network location (all proxy POPs, or a subset).
  • Region-wide capacity or network failures — not just service-level outages.
  • Path degradations visible from the proxy's vantage point even when the origin thinks it's fine.

It cannot detect:

  • Service-level failures that return fast HTTP 5xx (those arrive as completed connections with bad payloads, not as connection failures).
  • Origin issues that only affect cached content (cache hits don't re-exercise the path).
  • Silent corruption where connections succeed but the content is wrong.

For those, pair with origin HTTP status-code distributions and end-to-end response-correctness signals.

Vantage-point considerations

A connection-failure-rate metric is always observer-relative:

  • Cloudflare's edge sees failures from its POP network; its view can differ from the view of clients using other networks (e.g., clients on ISPs that don't peer with Cloudflare).
  • Per-origin-region granularity is meaningful because cloud regions are physically distinct environments; per-service or per-tenant granularity is often a better fit for service-level triage.
  • Baseline variance differs by region — regions with low-popularity workloads have more cache-miss traffic, and therefore more origin-exercising measurements per unit time, than regions saturated with cacheable CDN traffic.

Seen in

Last updated · 433 distilled / 1,256 read