Skip to content

PATTERN Cited by 1 source

Connection timeout = RTT × 3

Pattern

Size the connection timeout as roughly three times the expected round-trip time between caller and server. It is a conservative default that covers the TCP three-way handshake (~1 RTT) with enough margin to absorb transient jitter and service-startup race conditions, without pinning the caller on a dead peer for an arbitrary library-default duration.

Zalando's timeouts post canonicalises the heuristic:

"You can setup a connection timeout which is some multiple of your expected RTT. Connection timeout = RTT × 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

Why RTT, not operation latency

A connection timeout bounds only the TCP three-way handshake — a protocol event dominated by network propagation. The post-handshake work the server does is bounded by the separate request timeout. Sizing the connection half from operation latency (e.g. "set the connect timeout to match the query timeout") blurs the two and loses diagnostic signal about whether a failure was "couldn't reach the server" or "server is slow."

Reference RTTs

From the Zalando post:

Route RTT (approx) Connection timeout (RTT × 3)
Same data center / same AWS region < 1 ms 3–10 ms
NYC ↔ SF on fibre 42 ms 126 ms
Author's local machine → recommended AWS region 28 ms 84 ms
NYC ↔ Sydney 160 ms 480 ms

Mobile clients, VPN tunnels, and congested paths produce higher and more variable RTTs; the × 3 multiplier should be adjusted upward when the underlying RTT distribution is itself noisy.

Why × 3 specifically

The Zalando post does not derive the multiplier; it is presented as "commonly used as a conservative approach" and tunable. Intuitive decomposition:

  • × 1 — covers a perfectly clean handshake.
  • × 2 — covers one SYN retransmit (TCP SYN retransmit interval is ~1 s on most OSes, so for short RTTs this is optimistic).
  • × 3 — adds margin for transient packet loss, service startup (accept queue drain), and service restart windows.

For same-DC traffic the multiplier can be raised further — 10 × 1 ms is still 10 ms, well under any user-noticeable delay. For cross-continent traffic, × 3 may be the right ceiling because going higher eats too far into an SLA.

When to deviate

  • Cold-starting lambdas / serverless: SYN accept may wait on cold-start; may need × 5 or explicit fallback.
  • Routes with known SYN-cookie behaviour: may need extra margin for server-side validation.
  • Services behind a VIP / load balancer with active health checks: typical handshake is faster than bare-server case; × 2 may suffice.
  • Mobile clients: RTT is itself bimodal (Wi-Fi vs. cellular); the multiplier should be applied to the worse case, not the mean.

Anti-pattern: one "socket timeout" for both phases

A common mistake — explicitly called out in the Zalando post — is to set a single "socket timeout" that covers both connection and request, on the belief it's simpler. This produces two failure modes at once:

  • Connection timeout is too high → unreachable peers consume pool threads for the whole operation timeout.
  • Request timeout is too low → real server work gets cut off because the socket timeout is sized for handshake.

Sizing them separately costs no code and buys back the diagnostic signal.

Seen in

Last updated · 550 distilled / 1,221 read