Skip to content

PATTERN Cited by 1 source

Explicit timeout on remote calls

Pattern

Set both the connection timeout and the request timeout explicitly on every remote call. Never rely on library defaults; never leave a timeout unset on the theory that "it probably won't matter." Every outbound HTTP, gRPC, database, and RPC client gets both numbers written down, checked into source, and reviewed alongside the downstream it targets.

Zalando's timeouts post names this as the house-style rule:

"The default timeout is your enemy, always set timeouts explicitly!" (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

Motivation

Library defaults are chosen for maximum compatibility, not for production use. The Zalando post cites the canonical anti-example:

"For example for native java HttpClient the default connection/request timeout is infinite, which is unlikely within your SLA :)"

Other common offenders: - libcurl default connect-timeout of 300 s. - Java URLConnection default of infinite on most JVMs. - Many database drivers default to no query timeout. - requests.get(url) in Python has no default timeout.

A service inheriting any of these will, eventually, be taken out by a single stuck downstream: see concepts/thread-pool-exhaustion and concepts/connection-pool-exhaustion.

Shape

Two timeouts, sized independently:

Sizing the two together is a named anti-pattern: connection establishment and downstream work are bounded by different physical processes and should be configured separately.

Enforcement

  • Code review: every new outbound client gets a timeout pair explicitly in the reviewer's checklist.
  • Static analysis: lint rules flag client constructions that omit timeout configuration.
  • Integration tests: test that a dead downstream causes a fast failure (bounded wall-clock) rather than hanging.
  • Template / scaffolding: service templates ship with opinionated timeouts already set so new services inherit the discipline.

Trade-offs

Explicit timeouts surface latency decisions into code-review conversations rather than leaving them as accidental consequences of library defaults. The cost is that timeout numbers must be maintained — when a downstream's real p99.9 shifts, the caller's timeout must be re-visited.

Seen in

Last updated · 550 distilled / 1,221 read