PATTERN Cited by 1 source
Connection timeout = RTT × 3¶
Pattern¶
Size the connection timeout as roughly three times the expected round-trip time between caller and server. It is a conservative default that covers the TCP three-way handshake (~1 RTT) with enough margin to absorb transient jitter and service-startup race conditions, without pinning the caller on a dead peer for an arbitrary library-default duration.
Zalando's timeouts post canonicalises the heuristic:
"You can setup a connection timeout which is some multiple of your expected RTT. Connection timeout = RTT × 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)
Why RTT, not operation latency¶
A connection timeout bounds only the TCP three-way handshake — a protocol event dominated by network propagation. The post-handshake work the server does is bounded by the separate request timeout. Sizing the connection half from operation latency (e.g. "set the connect timeout to match the query timeout") blurs the two and loses diagnostic signal about whether a failure was "couldn't reach the server" or "server is slow."
Reference RTTs¶
From the Zalando post:
| Route | RTT (approx) | Connection timeout (RTT × 3) |
|---|---|---|
| Same data center / same AWS region | < 1 ms | 3–10 ms |
| NYC ↔ SF on fibre | 42 ms | 126 ms |
| Author's local machine → recommended AWS region | 28 ms | 84 ms |
| NYC ↔ Sydney | 160 ms | 480 ms |
Mobile clients, VPN tunnels, and congested paths produce
higher and more variable RTTs; the × 3 multiplier should be
adjusted upward when the underlying RTT distribution is itself
noisy.
Why × 3 specifically¶
The Zalando post does not derive the multiplier; it is presented as "commonly used as a conservative approach" and tunable. Intuitive decomposition:
× 1— covers a perfectly clean handshake.× 2— covers one SYN retransmit (TCP SYN retransmit interval is ~1 s on most OSes, so for short RTTs this is optimistic).× 3— adds margin for transient packet loss, service startup (accept queue drain), and service restart windows.
For same-DC traffic the multiplier can be raised further —
10 × 1 ms is still 10 ms, well under any user-noticeable delay.
For cross-continent traffic, × 3 may be the right ceiling
because going higher eats too far into an SLA.
When to deviate¶
- Cold-starting lambdas / serverless: SYN accept may wait
on cold-start; may need
× 5or explicit fallback. - Routes with known SYN-cookie behaviour: may need extra margin for server-side validation.
- Services behind a VIP / load balancer with active health
checks: typical handshake is faster than bare-server case;
× 2may suffice. - Mobile clients: RTT is itself bimodal (Wi-Fi vs. cellular); the multiplier should be applied to the worse case, not the mean.
Anti-pattern: one "socket timeout" for both phases¶
A common mistake — explicitly called out in the Zalando post — is to set a single "socket timeout" that covers both connection and request, on the belief it's simpler. This produces two failure modes at once:
- Connection timeout is too high → unreachable peers consume pool threads for the whole operation timeout.
- Request timeout is too low → real server work gets cut off because the socket timeout is sized for handshake.
Sizing them separately costs no code and buys back the diagnostic signal.
Related¶
- concepts/connection-timeout — the timeout this pattern sizes.
- concepts/round-trip-time-rtt — the network property that drives the number.
- concepts/tcp-three-way-handshake — the protocol event the timeout bounds.
- patterns/explicit-timeout-on-remote-calls — the broader rule requiring both timeouts to be set.
Seen in¶
- sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts — canonical wiki home. Zalando house-style heuristic.