CONCEPT Cited by 1 source
Fail-fast principle¶
Definition¶
Fail-fast is the design principle that when a system cannot satisfy a request within its quality bounds, it should surface the failure as quickly as possible rather than holding the caller in an indefinite wait. The client retries, invokes a fallback, or degrades gracefully — on a deterministic timeline set by the system designer, not by an unpredictable downstream.
Zalando's timeouts post frames it this way:
"A successful response, even if it takes time, is better than a timeout error. Hmm… not always, it depends! First of all, if your server does not respond or takes too long to respond, nobody will wait for it. Instead of challenging the patience of your users, follow the fail-fast principle. Let your clients retry or handle an error on their side. When possible return a fallback value." (Source: sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts)
Why it matters structurally¶
The opposite of fail-fast is fail-slow: the caller waits and holds pool resources (threads, HTTPS connections, DB connections) while a dead or overloaded peer times out on its own schedule. This drains the caller's pools and cascades failure upstream — a pattern captured in concepts/thread-pool-exhaustion and concepts/connection-pool-exhaustion.
Explicit, aggressive connection timeouts and request timeouts are the practical mechanism. They convert "how long does it take to learn this downstream is broken?" from a downstream-controlled variable into a client-controlled one.
Trade-offs¶
- Too aggressive: false failures during transient network blips or service startup. A request-timeout set below p99.9 latency guarantees a visible false-timeout rate.
- Too lenient: resource pinning, eventual pool exhaustion, cascading failure.
The Zalando post's synthesis: "The default timeout is your enemy, always set timeouts explicitly!" Fail-fast is not about panicking early; it is about making the failure boundary explicit and tunable.
Pairing with fallbacks¶
Fail-fast is most useful when paired with a pre-designed fallback: a cached value, a default response, a degraded feature, or a retry via a different path. Failing fast to no fallback is just failing quickly — still preferable to failing slowly, but lower value than failing fast to graceful degradation.
Seen in¶
- sources/2023-07-25-zalando-all-you-need-to-know-about-timeouts — canonical Zalando house-style framing: "follow the fail-fast principle. Let your clients retry or handle an error on their side."
Related¶
- concepts/connection-timeout / concepts/request-timeout — the mechanisms by which fail-fast is implemented on remote calls.
- concepts/thread-pool-exhaustion / concepts/connection-pool-exhaustion — the failure modes fail-slow produces.
- concepts/graceful-degradation — the companion principle on the response side.
- patterns/explicit-timeout-on-remote-calls — the house-style rule that operationalises fail-fast.
- patterns/circuit-breaker — escalates fail-fast from per-request to per-dependency.