Skip to content

ZALANDO 2023-07-25

Read original ↗

Zalando — All you need to know about timeouts

Summary

A 2023-07-25 Zalando Engineering post (author unattributed) that distills Zalando's house style for timeout configuration in microservice clients into a single reference. Its core argument: default timeouts are the enemy — many libraries ship with infinite or multi-minute defaults that are incompatible with microservice SLAs, so every remote call must set timeouts explicitly. The piece draws the sharp distinction between connection timeout (bounded by network RTT and the TCP three-way handshake) and request timeout (bounded by downstream server work); gives the RTT × 3 heuristic for connection timeouts; walks the chain-of-calls SLA budgeting problem with two named resolutions (time limiter wrapping chained calls vs. per-call budget split); and canonicalises Zalando's retry policy (retry on 5xx + timeouts, not 4xx) together with exponential backoff + jitter and the Stripe/AWS Idempotency-Key header as the enabling contract for safe retries on non-idempotent writes. The post's load-bearing mental model is the thread-pool-exhaustion cascade: increasing a timeout silently decreases application throughput, and a single stuck downstream with no timeout drains the caller's thread / connection / DB-connection pools — which is why "the default timeout is your enemy."

Key takeaways

  1. Default timeouts are the enemy; always set timeouts explicitly on every remote call. The article names the specific failure mode: "for native java HttpClient the default connection/ request timeout is infinite, which is unlikely within your SLA." Libraries ship with high or infinite defaults to maximise out-of-the-box compatibility; production services must override every single one (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts).

  2. Increasing a timeout potentially decreases throughput. The load-bearing quote: "when you increase timeouts you potentially decrease the throughput of your application." The mechanism is thread-pool / connection- pool exhaustion: while a client waits for a response, threads, HTTPS connections and database connections remain held. Even if the upstream client has closed the TCP connection, without a server-side timeout the server is still processing. The result: one stuck downstream drains the caller's pools and the caller stops serving unrelated traffic.

  3. Connection timeout and request timeout are bounded by different things and must be sized independently. The article explicitly calls out that the common practice of "set connection timeout equal to or slightly lower than the operation timeout" is wrong, because the two processes are different: establishing a TCP connection is fast and bounded by RTT, while an operation can take hundreds-to- thousands of ms of server work. Sizing them the same buries legitimate diagnostic signal. The connection timeout should be derived from network quality (same DC ~sub-ms RTT; NYC→SF ~42 ms; NYC→Sydney ~160 ms), and the request timeout from collected p99 / p99.9 latency of the downstream operation.

  4. Connection timeout = RTT × 3 is the canonical heuristic. "Connection timeout = RTT × 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs." Captures the TCP three-way handshake plus a small margin. Low enough to quickly detect an unreachable downstream; high enough to tolerate a transient network blip or a slow service startup. Canonicalised on the wiki as patterns/connection-timeout-rtt-times-three.

  5. Request-timeout sizing is driven by measured latency metrics plus an acceptable false-timeout rate. The prescribed sequence: (a) integrate with the new API in shadow mode (parallel to prod, separate thread-pool, mirrored traffic), (b) collect p50/p99/p99.9 from the downstream, (c) pick an acceptable false-timeout rate (e.g. 0.1%), (d) set request timeout = the corresponding latency percentile (0.1% false-timeout rate ⇒ p99.9). Then choose between "use the max timeout" and "lower the timeout + enable retries" as a deliberate trade-off. SLA documents from the upstream are only a starting point for test design, not a trustworthy timeout value.

  6. Chained calls have two canonical resolutions: time-budget split vs. outer TimeLimiter wrap. For a service with SLA 1000 ms calling Order (p99.9 = 700 ms) and Payment (p99.9 = 700 ms) sequentially: Option 1 — split 500 + 500 (guaranteed SLA, but both downstreams will produce false-positive timeouts because 500 < p99.9). Option 2 — set each per-call timeout to the full 700 ms and wrap the whole chain in a 1000 ms time-limiter (no false per-call timeouts, outer limiter enforces the SLA when one downstream is slow but not both). Option 2 exploits the fact that both downstreams rarely tail simultaneously. Java idiom: CompletableFuture.supplyAsync (…).thenApply(…).orTimeout(1, TimeUnit.SECONDS) or Resilience4j's TimeLimiter module.

  7. Retry on 5xx + timeouts; do not retry on 4xx. Canonical policy stated verbatim. The framing: idempotent operations are generally safe to retry; [[concepts/non-idempotent- operations|non-idempotent writes]] (financial txns, object-creation mutations) can create duplicates. Even when you believe an operation is idempotent, "if possible, ask the service owner whether it is a good idea to enable retries." Retries also amplify load on a struggling downstream, so always pair retries with a circuit breaker. This is the Zalando house-style retry contract (Source: this article).

  8. Idempotency-Key header is the enabling contract for safe retry on non-idempotent writes. "When creating or updating an object, use an idempotency key. Then, if a connection error occurs, you can safely repeat the request without the risk of creating a second object or performing the update twice." Named references: Stripe's Idempotent Requests and Amazon Builders' Library — Making retries safe with idempotent APIs. The load-bearing idea: server-side deduplication by caller-supplied key converts any write into an idempotent one from the caller's perspective.

  9. Exponential backoff + jitter is the default retry-scheduling strategy. Exponential backoff spreads retry load across time; jitter randomises the per-client schedule so a fleet of retrying clients doesn't synchronise into retry storms. Named reference: AWS Architecture Blog — Exponential Backoff and Jitter. Built into AWS SDKs by default.

  10. Retries are not universally helpful — p99 ≈ p50 is the diagnostic signal to avoid retries. The article contrasts two latency-distribution shapes: shape A (big gap between p50 and p99, occasional tail) — good case for retries; shape B (p99 close to p50, periodic timeouts) — do not retry. When the tail is the mean, retries don't beat the original attempt and just multiply load. Named as a distribution-shape diagnostic test for retry applicability (Source: this article).

Systems named

  • systems/java-httpclient — native Java HTTP client; explicitly called out as having infinite default connection and request timeouts, the canonical example of "the default timeout is your enemy."
  • systems/java-completablefuture — the Java standard- library async primitive the article recommends for wrapping a chain of calls with an outer time limit via orTimeout(…) or completeOnTimeOut(…).
  • systems/resilience4j — named JVM resilience library, author specifically cites its TimeLimiter module as the canonical wrapper pattern. (Same library community-standard for circuit breakers, bulkheads, and retry.)
  • systems/aws-sdk-retry — canonical reference for exponential backoff + jitter as a built-in retry strategy in production SDKs.
  • systems/stripe-api — canonical reference for the Idempotency-Key header pattern; the post links directly to Stripe's public docs.
  • systems/amazon-workspaces-health-check — cited as a live RTT health-check UI showing the author's 28 ms RTT to the recommended AWS region; used to anchor the "derive connection timeout from measured RTT" discipline.

Concepts named

Patterns named

Operational numbers

  • Round-trip times cited verbatim: NYC ↔ SF on fibre ~42 ms; NYC ↔ Sydney ~160 ms; author's local machine to recommended AWS region 28 ms (from the AWS WorkSpaces Connection Health Check page).
  • Example chained-call SLA: service SLA 1000 ms; Order Service p99.9 = 700 ms; Payment Service p99.9 = 700 ms.
  • Example false-timeout rate: 0.1% → set request timeout to downstream p99.9.
  • Split-budget example numbers: 500 ms per downstream (under Option 1).
  • Time-limiter example numbers: 700 ms per downstream + 1000 ms outer limit (under Option 2).

Caveats

  • No production Zalando numbers. The post is a reference article, not a case study. It doesn't report Zalando-specific fleet-wide timeout distributions, incident counts, or before/after metrics — all examples are generic.
  • No explicit defence of RTT × 3 specifically. The multiplier is stated as "commonly used" without a derivation or Zalando-internal calibration; treat it as a starting point, not a measured Zalando production value.
  • No discussion of server-side timeouts. The article is strictly client-side. Matching server-side request-processing timeouts, graceful-shutdown deadlines, and timeout propagation via headers (e.g. gRPC grpc-timeout) are out of scope.
  • p99 ≈ p50 diagnostic is presented visually without thresholds. The two-graph contrast is qualitative; the post doesn't quantify how close p99 and p50 must be to disable retries, nor does it address intermediate cases (periodic bi-modal distributions).
  • Circuit breakers are recommended but not designed. The article names circuit-breaker-paired-with-retry as a rule without any design guidance (thresholds, half-open behaviour, per-instance vs. per-dependency granularity).
  • No language-specific tuning beyond Java. All code snippets and library references (CompletableFuture, Resilience4j, HttpClient) are JVM-centric; Go / Python / Node production guidance is absent.

Source

Last updated · 501 distilled / 1,218 read