PATTERN

Retry on 5xx, not 4xx¶

Pattern¶

Retry client-side on 5xx responses and on timeouts. Do not retry on 4xx responses. The HTTP status-code class encodes whether the failure is likely transient (5xx — server reported it can't serve right now) or persistent (4xx — the request is wrong and the server has already told you so).

Zalando's timeouts post states the rule verbatim:

"Retry on timeout errors and 5xx errors. Do not retry on 4xx errors." (Source: )

Why the rule works¶

5xx ≈ server-side, likely transient. - 500 Internal Server Error — unhandled exception; may succeed on retry if it was transient (DB hiccup, GC pause, instance recycle). - 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout — upstream is temporarily unavailable; load balancer may pick a different backend next retry. - 501 Not Implemented, 505 HTTP Version Not Supported — these are persistent and should generally not be retried, which is why the rule is a strong default rather than an absolute law.

4xx ≈ client-side, persistent. - 400 Bad Request — malformed input; will fail again. - 401 Unauthorized / 403 Forbidden — credentials / policy; will fail again. - 404 Not Found — resource doesn't exist; will fail again. - 422 Unprocessable Entity — semantic error; will fail again.

Exceptions worth naming: - 408 Request Timeout — retryable by design (same failure mode as a timeout). - 425 Too Early — retryable once the server is ready. - 429 Too Many Requests — retryable with backoff and respecting Retry-After header.

Most client libraries that implement the rule default to "retry 5xx and timeouts; respect 429; never retry other 4xx."

Also retry: connection errors and timeouts¶

Zalando's rule names timeouts explicitly alongside 5xx. Expanding on the rationale: a timeout is the ambiguous failure case — the client doesn't know whether the request reached the server or not. The retry decision is governed by idempotency rather than by status code. For idempotent (or idempotency-keyed) operations, retrying timeouts is safe and captures transient network drops. For non-idempotent operations, see the Zalando post's companion discipline — use an Idempotency-Key header or do not retry.

Guardrails¶

The rule is a safe default, not a guarantee of safety. It must be paired with:

Circuit breaker — retries during a mass-5xx outage will amplify load on the failing downstream and delay its recovery. The Zalando post is explicit: "always consider implementing circuit breakers when enabling retry. Retries that increase load can make matters significantly worse."
Backoff + jitter — a tight retry loop on 5xx is itself a denial-of-service attack on the failing backend.
Bounded retry count — 2–3 retries is typical; more than that extends wall-clock without improving success rate.
Retry budget at the caller level — never retry so much that retries exceed some fraction (commonly 10%) of the original request volume; beyond that, the retries are the load problem.

When `p99 ≈ p50`, don't retry¶

Zalando's post names a distributional test for retry applicability: if the downstream's latency distribution has p99 close to p50 (periodic timeouts rather than occasional ones), retries don't help — the tail is the mean. Retries in that regime multiply load without improving success rate.

See concepts/tail-latency-at-scale for the distributional framing.

concepts/idempotent-operations / concepts/non-idempotent-operations — the property that gates whether retrying 5xx is safe by default or requires an idempotency-key contract.
concepts/exponential-backoff-jitter — the retry- scheduling strategy this pattern pairs with.
concepts/tail-latency-at-scale — the distributional test for retry applicability.
patterns/idempotency-key-header — the contract that makes retrying non-idempotent writes safe.
patterns/circuit-breaker — the mandatory companion to avoid retry-amplified outages.
patterns/explicit-timeout-on-remote-calls — the broader discipline this retry rule sits inside.

Seen in¶

— canonical wiki home. Zalando house-style client-side retry policy.