CONCEPT Cited by 1 source

Connection failure rate¶

Connection failure rate is a metric measuring how often a reverse proxy or CDN edge fails to establish a successful connection to an origin server when attempting to retrieve uncacheable content or content not in (or expired from) cache. It is a third-party uptime signal for whichever system hosts the origin — web site, cloud region, on-prem data center — derived from real proxy traffic rather than from synthetic probes.

The canonical definition is from the Cloudflare 2026-04-28 Q1 2026 disruption review:

"Connection failures occur when Cloudflare fails to successfully connect to an origin server when attempting to retrieve uncacheable content, or content not in/expired from cache."

Why cache-miss paths are the right measurement surface¶

A reverse proxy serves most requests from cache. Cache hits never touch the origin and therefore don't carry origin-health signal. The connection failure rate is measured on the subset of traffic that must go to the origin:

Uncacheable content (Cache-Control: no-store, Cache-Control: private, POST/PUT/DELETE requests).
Content not in cache (cold / low-popularity / new).
Content whose cache entry has expired (max-age elapsed, revalidation needed).

For these requests, the edge opens a fresh TCP + TLS connection to the origin. Each successful or failed connection attempt contributes to the rate.

The third-party uptime role¶

Cloud providers publish their own status dashboards, but those reflect the provider's internal view of whether services are operational. The connection-failure-rate signal reflects the external-caller's view of whether the provider's origin is reachable:

The provider's own health checks may pass while callers timeout — network-path failure the provider doesn't see.
The provider's dashboard may lag the real failure (internal aggregation + editorial review delays vs live traffic drop).
The provider's dashboard may omit failures it considers customer-configuration-related even when they are symptoms of a platform issue.

During the March 2026 drone strikes on AWS Middle East facilities (the canonical case in the sysdesign-wiki corpus), Cloudflare Cloud Observatory showed "elevated connection failure rates" for me-central-1 and me-south-1 "beginning March 1-2 and remaining higher for multiple days." This was independent, live, time-pinnable evidence of the regional degradation — a neutral evidence base that didn't depend on Amazon's own disclosure cadence.

What it can and cannot detect¶

It can detect:

Origin unreachability from the reverse proxy's network location (all proxy POPs, or a subset).
Region-wide capacity or network failures — not just service-level outages.
Path degradations visible from the proxy's vantage point even when the origin thinks it's fine.

It cannot detect:

Service-level failures that return fast HTTP 5xx (those arrive as completed connections with bad payloads, not as connection failures).
Origin issues that only affect cached content (cache hits don't re-exercise the path).
Silent corruption where connections succeed but the content is wrong.

For those, pair with origin HTTP status-code distributions and end-to-end response-correctness signals.

Vantage-point considerations¶

A connection-failure-rate metric is always observer-relative:

Cloudflare's edge sees failures from its POP network; its view can differ from the view of clients using other networks (e.g., clients on ISPs that don't peer with Cloudflare).
Per-origin-region granularity is meaningful because cloud regions are physically distinct environments; per-service or per-tenant granularity is often a better fit for service-level triage.
Baseline variance differs by region — regions with low-popularity workloads have more cache-miss traffic, and therefore more origin-exercising measurements per unit time, than regions saturated with cacheable CDN traffic.

Seen in¶

sources/2026-04-28-cloudflare-q1-2026-internet-disruption-summary — canonical wiki instance. The review provides the verbatim definition and uses connection failure rate on me-central-1 + me-south-1 as third-party kinetic-attack forensics, with time-pinned URL evidence (?dateStart=2026-02-28&dateEnd=2026-03-06#connection-metrics) during the March 1-2 drone strikes on Amazon Web Services Middle East data centers.