Skip to content

CONCEPT Cited by 1 source

Time budget sharing

Definition

Time budget sharing is the chained-call timeout strategy where a caller's SLA is divided among its sequential downstream calls, so that per-call timeouts sum to at most the caller's own deadline. It guarantees SLA compliance by construction at the cost of under-sizing each per-call budget relative to its downstream's observed latency tail.

Zalando's timeouts post uses the canonical example:

"Imagine your service has SLA 1000 ms and it calls sequentially Order Service with p99.9 = 700 ms and then Payment Service with p99.9 = 700 ms. How to configure timeout and not breach the SLA?

Option 1: Share your time budget. One option would be to share your time budget (your SLA) between services and set timeouts accordingly 500 ms for Order Service and 500 ms for Payment Service. In this case, you have a guarantee that you will not breach your SLA but you might have some false positive timeouts." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)

The trade-off

Time budget sharing guarantees SLA compliance: the caller literally cannot breach its SLA because the sum of per-call budgets is the SLA.

The cost is a false-timeout rate above the baseline: each per-call budget is lower than the downstream's p99.9, so the caller will timeout on the downstream's natural tail more frequently than the downstream itself reports as "slow." Concretely, in the Zalando example, 500 ms vs. 700 ms p99.9 means far more than 0.1% of calls tail past the cap.

Contrast: time-limiter wrap

The alternative resolution in the Zalando post is patterns/time-limiter-wrapping-chained-calls: leave each per-call timeout at p99.9 (or above) and wrap the whole chain in an outer time limiter equal to the SLA. This exploits the observation that both downstreams rarely tail simultaneously. Fewer per-call false timeouts; still SLA- safe; but enforcement relies on the outer wrapper firing when the aggregate budget is blown.

Strategy Per-call budget SLA guarantee False-timeout rate
Time budget sharing < downstream p99.9 Hard Elevated
Time-limiter wrap ≥ downstream p99.9 Hard (via outer limit) Near p99.9 baseline

When budget sharing is the right choice

  • Downstream latencies sum to more than the caller's SLA even at p50 — there is no slack; the only correct choice is to declare budgets explicitly.
  • Downstream tails are correlated (shared infrastructure, global resource pressure) so the time-limiter bet doesn't pay off.
  • Downstream tails are fat enough that the time-limiter approach often breaches.

Seen in

Last updated · 550 distilled / 1,221 read