PATTERN Cited by 1 source
Retry on 5xx, not 4xx¶
Pattern¶
Retry client-side on 5xx responses and on timeouts. Do not retry on 4xx responses. The HTTP status-code class encodes whether the failure is likely transient (5xx — server reported it can't serve right now) or persistent (4xx — the request is wrong and the server has already told you so).
Zalando's timeouts post states the rule verbatim:
"Retry on timeout errors and 5xx errors. Do not retry on 4xx errors." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)
Why the rule works¶
5xx ≈ server-side, likely transient.
- 500 Internal Server Error — unhandled exception; may succeed
on retry if it was transient (DB hiccup, GC pause, instance
recycle).
- 502 Bad Gateway, 503 Service Unavailable,
504 Gateway Timeout — upstream is temporarily unavailable;
load balancer may pick a different backend next retry.
- 501 Not Implemented, 505 HTTP Version Not Supported —
these are persistent and should generally not be retried,
which is why the rule is a strong default rather than an
absolute law.
4xx ≈ client-side, persistent.
- 400 Bad Request — malformed input; will fail again.
- 401 Unauthorized / 403 Forbidden — credentials / policy;
will fail again.
- 404 Not Found — resource doesn't exist; will fail again.
- 422 Unprocessable Entity — semantic error; will fail again.
Exceptions worth naming:
- 408 Request Timeout — retryable by design (same failure
mode as a timeout).
- 425 Too Early — retryable once the server is ready.
- 429 Too Many Requests — retryable with
backoff and
respecting Retry-After header.
Most client libraries that implement the rule default to "retry 5xx and timeouts; respect 429; never retry other 4xx."
Also retry: connection errors and timeouts¶
Zalando's rule names timeouts explicitly alongside 5xx. Expanding on the rationale: a timeout is the ambiguous failure case — the client doesn't know whether the request reached the server or not. The retry decision is governed by idempotency rather than by status code. For idempotent (or idempotency-keyed) operations, retrying timeouts is safe and captures transient network drops. For non-idempotent operations, see the Zalando post's companion discipline — use an Idempotency-Key header or do not retry.
Guardrails¶
The rule is a safe default, not a guarantee of safety. It must be paired with:
- Circuit breaker — retries during a mass-5xx outage will amplify load on the failing downstream and delay its recovery. The Zalando post is explicit: "always consider implementing circuit breakers when enabling retry. Retries that increase load can make matters significantly worse."
- Backoff + jitter — a tight retry loop on 5xx is itself a denial-of-service attack on the failing backend.
- Bounded retry count — 2–3 retries is typical; more than that extends wall-clock without improving success rate.
- Retry budget at the caller level — never retry so much that retries exceed some fraction (commonly 10%) of the original request volume; beyond that, the retries are the load problem.
When p99 ≈ p50, don't retry¶
Zalando's post names a distributional test for retry applicability: if the downstream's latency distribution has p99 close to p50 (periodic timeouts rather than occasional ones), retries don't help — the tail is the mean. Retries in that regime multiply load without improving success rate.
See concepts/tail-latency-at-scale for the distributional framing.
Related¶
- concepts/idempotent-operations / concepts/non-idempotent-operations — the property that gates whether retrying 5xx is safe by default or requires an idempotency-key contract.
- concepts/exponential-backoff-jitter — the retry- scheduling strategy this pattern pairs with.
- concepts/tail-latency-at-scale — the distributional test for retry applicability.
- patterns/idempotency-key-header — the contract that makes retrying non-idempotent writes safe.
- patterns/circuit-breaker — the mandatory companion to avoid retry-amplified outages.
- patterns/explicit-timeout-on-remote-calls — the broader discipline this retry rule sits inside.
Seen in¶
- sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts — canonical wiki home. Zalando house-style client-side retry policy.