Skip to content

PATTERN Cited by 1 source

Retry on 5xx, not 4xx

Pattern

Retry client-side on 5xx responses and on timeouts. Do not retry on 4xx responses. The HTTP status-code class encodes whether the failure is likely transient (5xx — server reported it can't serve right now) or persistent (4xx — the request is wrong and the server has already told you so).

Zalando's timeouts post states the rule verbatim:

"Retry on timeout errors and 5xx errors. Do not retry on 4xx errors." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

Why the rule works

5xx ≈ server-side, likely transient. - 500 Internal Server Error — unhandled exception; may succeed on retry if it was transient (DB hiccup, GC pause, instance recycle). - 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout — upstream is temporarily unavailable; load balancer may pick a different backend next retry. - 501 Not Implemented, 505 HTTP Version Not Supportedthese are persistent and should generally not be retried, which is why the rule is a strong default rather than an absolute law.

4xx ≈ client-side, persistent. - 400 Bad Request — malformed input; will fail again. - 401 Unauthorized / 403 Forbidden — credentials / policy; will fail again. - 404 Not Found — resource doesn't exist; will fail again. - 422 Unprocessable Entity — semantic error; will fail again.

Exceptions worth naming: - 408 Request Timeout — retryable by design (same failure mode as a timeout). - 425 Too Early — retryable once the server is ready. - 429 Too Many Requests — retryable with backoff and respecting Retry-After header.

Most client libraries that implement the rule default to "retry 5xx and timeouts; respect 429; never retry other 4xx."

Also retry: connection errors and timeouts

Zalando's rule names timeouts explicitly alongside 5xx. Expanding on the rationale: a timeout is the ambiguous failure case — the client doesn't know whether the request reached the server or not. The retry decision is governed by idempotency rather than by status code. For idempotent (or idempotency-keyed) operations, retrying timeouts is safe and captures transient network drops. For non-idempotent operations, see the Zalando post's companion discipline — use an Idempotency-Key header or do not retry.

Guardrails

The rule is a safe default, not a guarantee of safety. It must be paired with:

  • Circuit breaker — retries during a mass-5xx outage will amplify load on the failing downstream and delay its recovery. The Zalando post is explicit: "always consider implementing circuit breakers when enabling retry. Retries that increase load can make matters significantly worse."
  • Backoff + jitter — a tight retry loop on 5xx is itself a denial-of-service attack on the failing backend.
  • Bounded retry count — 2–3 retries is typical; more than that extends wall-clock without improving success rate.
  • Retry budget at the caller level — never retry so much that retries exceed some fraction (commonly 10%) of the original request volume; beyond that, the retries are the load problem.

When p99 ≈ p50, don't retry

Zalando's post names a distributional test for retry applicability: if the downstream's latency distribution has p99 close to p50 (periodic timeouts rather than occasional ones), retries don't help — the tail is the mean. Retries in that regime multiply load without improving success rate.

See concepts/tail-latency-at-scale for the distributional framing.

Seen in

Last updated · 550 distilled / 1,221 read