CONCEPT Cited by 1 source
Thread pool exhaustion¶
Definition¶
Thread pool exhaustion is the failure mode where every worker thread in an application's request-handling pool is blocked waiting on a slow or unresponsive downstream. New incoming requests either queue, time out, or are rejected — even if the caller has capacity to start work, it has none to complete work.
It is the canonical cascading-failure mechanism behind "one slow downstream took out the whole service."
The Zalando framing¶
The Zalando timeouts post names this explicitly as the load- bearing reason for aggressive explicit timeouts:
"While a client is waiting for a response, various resources are being utilised: threads, https connections, database connections, etc. Even if the client has closed the connection, without a proper timeout configuration the request is still being processed on your side, which means that resources are busy."
"Remember, when you increase timeouts you potentially decrease the throughput of your application!"
"Using infinite timeout or very high timeout is a bad strategy. For a while, you won't see the problem until one of your downstream services gets stuck and your thread pool gets exhausted." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)
The load-bearing insight: the default timeout sets the maximum time a caller can be blocked by a single dead downstream. Infinite or high defaults turn a single downstream failure into a caller-side outage.
The mechanism¶
- Downstream D becomes slow or unresponsive.
- Caller C starts a request; the handler thread blocks on D.
- New requests arrive at C; each takes a thread and blocks.
- Eventually all threads in C's pool are blocked on D.
- C stops serving unrelated traffic — requests for other dependencies fail because no handler thread is free.
- C's callers see C as dead and the cascade propagates upward.
Without an explicit timeout the only release event is D responding. If D is truly dead, C's pool never recovers on its own.
Why it's a throughput problem, not just an error problem¶
Zalando's phrasing — "when you increase timeouts you potentially decrease the throughput of your application" — is worth unpacking. Throughput (req/s) is bounded by Little's law:
Throughput = ConcurrentInFlight / AvgLatency
If ConcurrentInFlight is capped by pool size, doubling AvgLatency halves throughput. A 30 s timeout on a typically 100 ms operation means every stalled request consumes 300× the normal thread-seconds. Even a small fraction of stalled requests dominates the pool's capacity.
Related exhaustion modes¶
Thread pool exhaustion usually co-occurs with concepts/connection-pool-exhaustion, HTTPS / TCP connection-slot exhaustion, and database-connection exhaustion — all of which are downstream of the same "unbounded wait holds a finite resource" pattern. Fixing one without the others merely shifts the bottleneck.
Mitigations¶
- Explicit timeouts on every remote call — the Zalando patterns/explicit-timeout-on-remote-calls rule. Bounds the worst-case pool-thread holding time.
- Bulkheads — isolate thread pools per dependency so a slow dependency can't starve unrelated traffic.
- Circuit breakers — stop issuing requests to a dependency after an error threshold, freeing pool capacity to handle other work.
- Backpressure / load-shedding — reject new requests at the edge once the in-flight count approaches pool capacity, rather than queueing them.
Seen in¶
- sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts — canonical Zalando framing naming thread-pool exhaustion as the load-bearing reason to set explicit timeouts; worked through with the "infinite Java HttpClient default" example.
Related¶
- concepts/connection-pool-exhaustion — sibling failure mode; typically co-occurs.
- concepts/connection-timeout / concepts/request-timeout — the bounds that prevent a single dead downstream from consuming the pool.
- concepts/fail-fast-principle — the design principle that motivates aggressive timeout sizing.
- concepts/cascading-failure — thread-pool exhaustion is the dominant cascading-failure mechanism.
- patterns/circuit-breaker — stops exhausting the pool against a known-bad dependency.
- patterns/explicit-timeout-on-remote-calls — the house- style rule that bounds per-request pool holding time.