Skip to content

CONCEPT Cited by 1 source

Availability multiplication of dependencies

Definition

A service synchronously depending on N downstream components has a request availability upper-bounded by the product of each component's individual availability. Two 99.9% dependencies on the critical path cap the caller at 99.9% × 99.9% = 99.8%. Every downstream hop that must succeed for the caller to return 2xx adds one more multiplication and shaves another decimal.

The arithmetic is obvious; the architectural consequence is not: teams often compose services freely on the synchronous path, unaware that each additional "must-succeed" downstream call erodes their SLO.

Arithmetic

For independent failures (approximation — in practice failures cluster on shared infra, but the directional point stands):

N deps @ 99.9% Availability ceiling
1 99.9%
2 99.8%
3 99.7%
5 99.5%
10 99.0%

For mixed levels, multiply: A = A1 × A2 × ... × An.

The Zalando Payments framing

Zalando Payments's Order Store service originally performed two synchronous operations per REST call: write to DynamoDB and publish a change event to Nakadi (Zalando's Kafka-backed event bus). Both downstreams individually at 99.9% bounded the service at 99.8%.

"As the availability of a service is the product of the availabilities of its dependencies, the more dependencies a service has, the lesser is its own availability. Let's assume DynamoDB and the message bus have availabilities of 99.9% each. Thus, the maximum availability for the service is 99.9% × 99.9% = 99.8%." (Source: sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication)

The fix: push Nakadi publication off the synchronous path using the transactional outbox pattern — the service's critical path depends only on DynamoDB, so its ceiling returns to 99.9%. Nakadi unavailability can still delay event delivery, but it no longer fails the client's write.

The four architectural responses

Once the product-ceiling problem is recognised, four moves shrink the dependency count on the critical path:

  1. Decouple via durable log / outbox. Write once to a local durable store; a separate relay reads and fans out to secondary sinks asynchronously. See patterns/transactional-outbox, patterns/dynamodb-streams-plus-lambda-outbox-relay.
  2. Cache the answer. If the downstream is a read — keep a recent answer, serve during outage. See concepts/cache-for-availability.
  3. Fail open where safe. Degraded-but-up beats down for non-critical paths (typically analytics, suggestions, personalisation).
  4. Shed the dependency entirely. Sometimes the right move is to ask "does this call belong on the hot path?" — the answer is often no.

Why the arithmetic understates reality

In practice the ceiling is worse than the product:

  • Retry amplification. A downstream tempo spike causes upstream retries, which increase downstream load, which causes more retries.
  • Shared failure domains. A single AZ outage takes out multiple "independent" dependencies at once.
  • Tail-latency propagation. Even a healthy downstream at p99.9 = 1s blows the caller's p99 if it's on the hot path.

So 99.9% × 99.9% = 99.8% is a best case, not a typical case.

The inverse: every removed dependency buys reliability

Moving Nakadi off Order Store's critical path is a 0.1% availability gain on paper (99.8% → 99.9%). In practice it's much larger because Nakadi outages — even brief ones — no longer produce 5xx at the REST API. The write still succeeds; the event just arrives later via DynamoDB Streams + a Lambda relay.

Seen in

Last updated · 550 distilled / 1,221 read