Skip to content

CONCEPT Cited by 1 source

Exponential backoff with jitter

Definition

Exponential backoff with jitter is the retry-scheduling strategy that pairs two disciplines:

  1. Exponential backoff — delay between successive retry attempts grows geometrically (e.g. 100 ms → 200 ms → 400 ms → 800 ms), reducing load on a failing downstream instead of hammering it on a fixed interval.
  2. Jitter — the per-retry delay is randomised within each backoff window, so a fleet of retrying clients does not synchronise into a retry storm.

The Zalando timeouts post canonicalises both together:

"Implementing exponential backoff can be an effective retry strategy. It involves increasing the delay between each retry attempt exponentially, reducing the load on the failing service and preventing overwhelming it with repeated requests. Here is a fantastic blog on how AWS SDKs support exponential backoff and jitter as a part of their retry behaviour." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

Named reference: AWS Architecture Blog — Exponential Backoff And Jitter. The strategy is built into AWS SDKs by default.

Why exponential backoff alone is not enough

Pure exponential backoff without jitter still synchronises a fleet. If 10,000 clients all fail at the same instant and each waits exactly 100 ms, 200 ms, 400 ms, … they retry simultaneously on each retry wave — essentially preserving the thundering-herd shape that caused the initial overload. The failing service sees 10,000 coordinated waves instead of a gentle spread.

Jitter de-correlates the fleet: each client's delay is uniformly sampled from [0, backoff_window] (full jitter) or centered on the exponential backoff value (equal jitter). The AWS blog argues for full jitter as the variant with the best fleet-level behaviour under worst-case contention.

Jitter variants

Variant Delay for attempt n
No jitter base × 2^n (synchronises fleet)
Equal jitter base × 2^n / 2 + random(0, base × 2^n / 2)
Full jitter random(0, base × 2^n)
Decorrelated jitter random(base, min(cap, prev × 3))

Full jitter and decorrelated jitter are the two most cited at production scale.

Parameters that matter

  • Base delay — starting point; typically 50–200 ms for API retries.
  • Cap — upper bound on the backoff window, preventing runaway delays on long outages (common: 20–60 s).
  • Max retries — bounds total retry wall time; must fit within the caller's SLA if the caller has one.
  • Per-call vs per-chain budget — retry delays compound with chained calls; total retry time must fit within the outer SLA or per-call request timeout.

Pairing with circuit breakers

The Zalando post is emphatic that backoff-with-jitter is not a replacement for a circuit breaker:

"Circuit breaker: always consider implementing circuit breakers when enabling retry. When failures are rare, that's not a problem. Retries that increase load can make matters significantly worse."

Backoff reduces per-retry load; circuit breaker eliminates retry load against a downstream that is persistently failing. Both are needed.

Seen in

Last updated · 550 distilled / 1,221 read