Skip to content

CONCEPT Cited by 1 source

Thread pool exhaustion

Definition

Thread pool exhaustion is the failure mode where every worker thread in an application's request-handling pool is blocked waiting on a slow or unresponsive downstream. New incoming requests either queue, time out, or are rejected — even if the caller has capacity to start work, it has none to complete work.

It is the canonical cascading-failure mechanism behind "one slow downstream took out the whole service."

The Zalando framing

The Zalando timeouts post names this explicitly as the load- bearing reason for aggressive explicit timeouts:

"While a client is waiting for a response, various resources are being utilised: threads, https connections, database connections, etc. Even if the client has closed the connection, without a proper timeout configuration the request is still being processed on your side, which means that resources are busy."

"Remember, when you increase timeouts you potentially decrease the throughput of your application!"

"Using infinite timeout or very high timeout is a bad strategy. For a while, you won't see the problem until one of your downstream services gets stuck and your thread pool gets exhausted." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

The load-bearing insight: the default timeout sets the maximum time a caller can be blocked by a single dead downstream. Infinite or high defaults turn a single downstream failure into a caller-side outage.

The mechanism

  1. Downstream D becomes slow or unresponsive.
  2. Caller C starts a request; the handler thread blocks on D.
  3. New requests arrive at C; each takes a thread and blocks.
  4. Eventually all threads in C's pool are blocked on D.
  5. C stops serving unrelated traffic — requests for other dependencies fail because no handler thread is free.
  6. C's callers see C as dead and the cascade propagates upward.

Without an explicit timeout the only release event is D responding. If D is truly dead, C's pool never recovers on its own.

Why it's a throughput problem, not just an error problem

Zalando's phrasing — "when you increase timeouts you potentially decrease the throughput of your application" — is worth unpacking. Throughput (req/s) is bounded by Little's law:

Throughput = ConcurrentInFlight / AvgLatency

If ConcurrentInFlight is capped by pool size, doubling AvgLatency halves throughput. A 30 s timeout on a typically 100 ms operation means every stalled request consumes 300× the normal thread-seconds. Even a small fraction of stalled requests dominates the pool's capacity.

Thread pool exhaustion usually co-occurs with concepts/connection-pool-exhaustion, HTTPS / TCP connection-slot exhaustion, and database-connection exhaustion — all of which are downstream of the same "unbounded wait holds a finite resource" pattern. Fixing one without the others merely shifts the bottleneck.

Mitigations

  • Explicit timeouts on every remote call — the Zalando patterns/explicit-timeout-on-remote-calls rule. Bounds the worst-case pool-thread holding time.
  • Bulkheads — isolate thread pools per dependency so a slow dependency can't starve unrelated traffic.
  • Circuit breakers — stop issuing requests to a dependency after an error threshold, freeing pool capacity to handle other work.
  • Backpressure / load-shedding — reject new requests at the edge once the in-flight count approaches pool capacity, rather than queueing them.

Seen in

Last updated · 550 distilled / 1,221 read