Zalando — All you need to know about timeouts¶
Summary¶
A 2023-07-25 Zalando Engineering post (author unattributed) that distills Zalando's house style for timeout configuration in microservice clients into a single reference. Its core argument: default timeouts are the enemy — many libraries ship with infinite or multi-minute defaults that are incompatible with microservice SLAs, so every remote call must set timeouts explicitly. The piece draws the sharp distinction between connection timeout (bounded by network RTT and the TCP three-way handshake) and request timeout (bounded by downstream server work); gives the RTT × 3 heuristic for connection timeouts; walks the chain-of-calls SLA budgeting problem with two named resolutions (time limiter wrapping chained calls vs. per-call budget split); and canonicalises Zalando's retry policy (retry on 5xx + timeouts, not 4xx) together with exponential backoff + jitter and the Stripe/AWS Idempotency-Key header as the enabling contract for safe retries on non-idempotent writes. The post's load-bearing mental model is the thread-pool-exhaustion cascade: increasing a timeout silently decreases application throughput, and a single stuck downstream with no timeout drains the caller's thread / connection / DB-connection pools — which is why "the default timeout is your enemy."
Key takeaways¶
-
Default timeouts are the enemy; always set timeouts explicitly on every remote call. The article names the specific failure mode: "for native java HttpClient the default connection/ request timeout is infinite, which is unlikely within your SLA." Libraries ship with high or infinite defaults to maximise out-of-the-box compatibility; production services must override every single one (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts).
-
Increasing a timeout potentially decreases throughput. The load-bearing quote: "when you increase timeouts you potentially decrease the throughput of your application." The mechanism is thread-pool / connection- pool exhaustion: while a client waits for a response, threads, HTTPS connections and database connections remain held. Even if the upstream client has closed the TCP connection, without a server-side timeout the server is still processing. The result: one stuck downstream drains the caller's pools and the caller stops serving unrelated traffic.
-
Connection timeout and request timeout are bounded by different things and must be sized independently. The article explicitly calls out that the common practice of "set connection timeout equal to or slightly lower than the operation timeout" is wrong, because the two processes are different: establishing a TCP connection is fast and bounded by RTT, while an operation can take hundreds-to- thousands of ms of server work. Sizing them the same buries legitimate diagnostic signal. The connection timeout should be derived from network quality (same DC ~sub-ms RTT; NYC→SF ~42 ms; NYC→Sydney ~160 ms), and the request timeout from collected p99 / p99.9 latency of the downstream operation.
-
Connection timeout = RTT × 3is the canonical heuristic. "Connection timeout = RTT × 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs." Captures the TCP three-way handshake plus a small margin. Low enough to quickly detect an unreachable downstream; high enough to tolerate a transient network blip or a slow service startup. Canonicalised on the wiki as patterns/connection-timeout-rtt-times-three. -
Request-timeout sizing is driven by measured latency metrics plus an acceptable false-timeout rate. The prescribed sequence: (a) integrate with the new API in shadow mode (parallel to prod, separate thread-pool, mirrored traffic), (b) collect p50/p99/p99.9 from the downstream, (c) pick an acceptable false-timeout rate (e.g. 0.1%), (d) set request timeout = the corresponding latency percentile (0.1% false-timeout rate ⇒ p99.9). Then choose between "use the max timeout" and "lower the timeout + enable retries" as a deliberate trade-off. SLA documents from the upstream are only a starting point for test design, not a trustworthy timeout value.
-
Chained calls have two canonical resolutions: time-budget split vs. outer TimeLimiter wrap. For a service with SLA 1000 ms calling Order (p99.9 = 700 ms) and Payment (p99.9 = 700 ms) sequentially: Option 1 — split 500 + 500 (guaranteed SLA, but both downstreams will produce false-positive timeouts because 500 < p99.9). Option 2 — set each per-call timeout to the full 700 ms and wrap the whole chain in a 1000 ms time-limiter (no false per-call timeouts, outer limiter enforces the SLA when one downstream is slow but not both). Option 2 exploits the fact that both downstreams rarely tail simultaneously. Java idiom:
CompletableFuture.supplyAsync (…).thenApply(…).orTimeout(1, TimeUnit.SECONDS)or Resilience4j'sTimeLimitermodule. -
Retry on 5xx + timeouts; do not retry on 4xx. Canonical policy stated verbatim. The framing: idempotent operations are generally safe to retry; [[concepts/non-idempotent- operations|non-idempotent writes]] (financial txns, object-creation mutations) can create duplicates. Even when you believe an operation is idempotent, "if possible, ask the service owner whether it is a good idea to enable retries." Retries also amplify load on a struggling downstream, so always pair retries with a circuit breaker. This is the Zalando house-style retry contract (Source: this article).
-
Idempotency-Key header is the enabling contract for safe retry on non-idempotent writes. "When creating or updating an object, use an idempotency key. Then, if a connection error occurs, you can safely repeat the request without the risk of creating a second object or performing the update twice." Named references: Stripe's Idempotent Requests and Amazon Builders' Library — Making retries safe with idempotent APIs. The load-bearing idea: server-side deduplication by caller-supplied key converts any write into an idempotent one from the caller's perspective.
-
Exponential backoff + jitter is the default retry-scheduling strategy. Exponential backoff spreads retry load across time; jitter randomises the per-client schedule so a fleet of retrying clients doesn't synchronise into retry storms. Named reference: AWS Architecture Blog — Exponential Backoff and Jitter. Built into AWS SDKs by default.
-
Retries are not universally helpful —
p99 ≈ p50is the diagnostic signal to avoid retries. The article contrasts two latency-distribution shapes: shape A (big gap between p50 and p99, occasional tail) — good case for retries; shape B (p99 close to p50, periodic timeouts) — do not retry. When the tail is the mean, retries don't beat the original attempt and just multiply load. Named as a distribution-shape diagnostic test for retry applicability (Source: this article).
Systems named¶
- systems/java-httpclient — native Java HTTP client; explicitly called out as having infinite default connection and request timeouts, the canonical example of "the default timeout is your enemy."
- systems/java-completablefuture — the Java standard-
library async primitive the article recommends for wrapping a
chain of calls with an outer time limit via
orTimeout(…)orcompleteOnTimeOut(…). - systems/resilience4j — named JVM resilience library,
author specifically cites its
TimeLimitermodule as the canonical wrapper pattern. (Same library community-standard for circuit breakers, bulkheads, and retry.) - systems/aws-sdk-retry — canonical reference for exponential backoff + jitter as a built-in retry strategy in production SDKs.
- systems/stripe-api — canonical reference for the Idempotency-Key header pattern; the post links directly to Stripe's public docs.
- systems/amazon-workspaces-health-check — cited as a live RTT health-check UI showing the author's 28 ms RTT to the recommended AWS region; used to anchor the "derive connection timeout from measured RTT" discipline.
Concepts named¶
- concepts/connection-timeout — the TCP-handshake-bounded timeout; first canonical wiki instance.
- concepts/request-timeout — the post-connection server- response timeout; first canonical wiki instance.
- concepts/tcp-three-way-handshake — SYN / SYN-ACK / ACK exchange whose completion bounds the connection timeout.
- concepts/round-trip-time-rtt — the foundational network property from which connection timeouts are derived; post gives concrete numbers (NYC↔SF ~42 ms, NYC↔Sydney ~160 ms).
- concepts/fail-fast-principle — the general design principle behind setting tight timeouts.
- concepts/thread-pool-exhaustion — the failure mode unbounded or too-high timeouts eventually trigger.
- concepts/time-budget-sharing — the per-call SLA- division resolution for chained calls (Option 1 in the post).
- concepts/false-timeout-rate — the explicit tunable parameter (e.g. 0.1%) that determines how high to set the request timeout based on collected latency percentiles.
- concepts/shadow-mode-metric-collection — running a new integration in parallel with production on a separate thread- pool / mirrored traffic to collect latency metrics before production cut-over.
- concepts/exponential-backoff-jitter — retry-scheduling strategy pairing geometric delay growth with randomisation.
- concepts/idempotent-operations vs. concepts/non-idempotent-operations — the determining property for whether retries are safe without an Idempotency- Key.
Patterns named¶
- patterns/explicit-timeout-on-remote-calls — the "set timeouts explicitly on any remote call" rule; Zalando house- style default.
- patterns/connection-timeout-rtt-times-three — the RTT × 3 heuristic for connection-timeout sizing.
- patterns/time-limiter-wrapping-chained-calls — Option 2 for chained-call budgets: full per-call timeouts + outer time-limiter.
- patterns/retry-on-5xx-not-4xx — canonical client-side retry policy (retry on 5xx + timeouts; never on 4xx).
- patterns/idempotency-key-header — caller-supplied key for server-side deduplication; enables safe retry of non- idempotent writes.
- patterns/circuit-breaker — paired with retry to avoid retry-induced amplification of downstream failure.
Operational numbers¶
- Round-trip times cited verbatim: NYC ↔ SF on fibre ~42 ms; NYC ↔ Sydney ~160 ms; author's local machine to recommended AWS region 28 ms (from the AWS WorkSpaces Connection Health Check page).
- Example chained-call SLA: service SLA 1000 ms; Order Service p99.9 = 700 ms; Payment Service p99.9 = 700 ms.
- Example false-timeout rate: 0.1% → set request timeout to downstream p99.9.
- Split-budget example numbers: 500 ms per downstream (under Option 1).
- Time-limiter example numbers: 700 ms per downstream + 1000 ms outer limit (under Option 2).
Caveats¶
- No production Zalando numbers. The post is a reference article, not a case study. It doesn't report Zalando-specific fleet-wide timeout distributions, incident counts, or before/after metrics — all examples are generic.
- No explicit defence of
RTT × 3specifically. The multiplier is stated as "commonly used" without a derivation or Zalando-internal calibration; treat it as a starting point, not a measured Zalando production value. - No discussion of server-side timeouts. The article is
strictly client-side. Matching server-side request-processing
timeouts, graceful-shutdown deadlines, and timeout propagation
via headers (e.g. gRPC
grpc-timeout) are out of scope. p99 ≈ p50diagnostic is presented visually without thresholds. The two-graph contrast is qualitative; the post doesn't quantify how close p99 and p50 must be to disable retries, nor does it address intermediate cases (periodic bi-modal distributions).- Circuit breakers are recommended but not designed. The article names circuit-breaker-paired-with-retry as a rule without any design guidance (thresholds, half-open behaviour, per-instance vs. per-dependency granularity).
- No language-specific tuning beyond Java. All code snippets and library references (CompletableFuture, Resilience4j, HttpClient) are JVM-centric; Go / Python / Node production guidance is absent.
Source¶
- Original: https://engineering.zalando.com/posts/2023/07/all-you-need-to-know-about-timeouts.html
- Raw markdown:
raw/zalando/2023-07-25-all-you-need-to-know-about-timeouts-778f9a81.md
Related¶
- companies/zalando — Zalando Engineering, Tier-2 source; this is the 13th ingest and opens a new timeouts + retry + resilience axis (axis 10) distinct from the existing nine.
- concepts/tail-latency-at-scale — the latency-distribution framing underlying the retry-applicability diagnostic.
- patterns/circuit-breaker — must-pair companion to retry.
- concepts/connection-pool-exhaustion — specific instance of the thread-pool-exhaustion failure mode the post names.