PATTERN Cited by 1 source
Infinite retry by failure class¶
Intent¶
Apply a different retry policy per failure class instead of one- size-fits-all retry-with-backoff — so that genuinely transient failures retry forever, genuinely deterministic failures retry only a bounded number of times, and failures that admit preprocessing retry with that preprocessing applied (but only on the second attempt, to avoid overhead on the common case).
The pattern requires a well-defined provider failure taxonomy — enumerated failure classes with per-class cause-of-failure analysis — as the input on which to branch.
When to use¶
- API returns typed failures with distinguishable classes (timeout vs rate-limit vs bad-input vs unreachable-dependency).
- Some classes are unboundedly transient (rate-limit, timeout on slow upstream); some are deterministic (content-filtered prompt, malformed input); some admit preprocessing (image-URL check).
- Budget to burn on retries is significant (money-per-call or wall-clock SLA).
- Durable substrate available (concepts/durable-execution) so infinite-retry loops actually survive caller restarts.
Mechanics¶
Canonical realisation: Instacart Maple for LLM batch API task-level failures (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):
| Class | Cause | Policy |
|---|---|---|
| Expired | Provider fails to return within 24h SLA | Infinite retry (new batch) |
| Rate-limited | Caller hit provider's token-per-minute limit | Infinite retry |
| Refused | Provider refuses — bad params, filtered content | Bounded: max 2× |
| Invalid image | Image URL dead or unreachable | Optional retry; on retry, check image exists before resubmitting — deferred because "checking each image in a large batch can add significant overhead" |
Two design choices encoded:
- Per-class retry budget — some classes infinite, some bounded.
- Per-class preprocessing — image-check deferred to retry #2 only, because applying it preemptively burns CPU + network on the common case where it would have succeeded anyway.
Why not exponential-backoff-with-retry-cap?¶
The naive retry-with-backoff loop:
for attempt in 1..=MAX_RETRIES:
result = call(input)
if result.ok: return result
if result.retriable: sleep(exp(attempt))
else: return result.err
…is a shape that works when failure classes are indistinguishable. When classes differ materially — rate-limits take minutes to clear, refusals are deterministic, invalid images might need preprocessing — the one-shape policy fails at each end:
- Caps too low for capacity-issue retries — 5 attempts with exponential backoff tops out at ~30 seconds. Provider rate limits can last minutes to hours. Caller gives up while the remedy is still time.
- Caps too high for deterministic refusals — 5 attempts burn 5× the cost on a refusal that will always refuse.
- No preprocessing hook — no place to say "before this retry, validate the URL."
The per-class pattern solves all three by separating policy per class from retry mechanism.
Precondition: the taxonomy has to exist¶
This pattern rests on the API being honest about why a request
failed — a concepts/provider-failure-taxonomy with enough
classes to distinguish transient from deterministic from
preprocessable. LLM batch APIs happen to have good taxonomies (four
distinct classes named). If an API returns only {success: bool},
this pattern degrades back to exponential-backoff-with-retry-cap.
Precondition: durable execution¶
Infinite retry loops only make sense on a substrate that won't lose state to process crashes. Maple uses Temporal; Temporal's activity retry + workflow replay lets the loop literally run forever without the service getting stuck "mid-retry-on-an-old- batch" on restart.
Generalisation¶
The pattern is reusable anywhere an API has a per-class failure taxonomy:
- Email delivery — permanent-bounce (don't retry) vs temporary-bounce (retry) vs quota-exceeded (infinite retry with backoff).
- Payment APIs — insufficient-funds (don't retry), issuer- timeout (retry), card-declined (don't).
- Object storage uploads — 5xx (retry), 403 (don't),
SlowDown(infinite retry with aggressive backoff). - Image pipelines —
UrlNotFoundcould trigger aHEADcheck before retry;Corruptnever retries.
Caveats¶
- Infinite retry needs a human escape hatch — wedged batches (genuinely dead provider, dead URL, dead account) consume loop iterations forever. Production deployments usually add some ceiling (max 30 days, max 1000 attempts) plus alerting on retry depth.
- Budget accounting — batch LLM APIs bill on submission; the infinite-retry decision is a money decision, not just a reliability decision. Maple's "avoids wasting money on partially completed jobs" framing from Temporal durable execution pairs with this — if Temporal durably tracks which tasks succeeded, infinite retry only spends money on actually-failed tasks.
- Class misclassification is expensive — if the provider returns
a
refusedwhen it should have returnedrate-limited, the caller gives up after 2× when it should retry forever. Taxonomy quality is load-bearing.
Seen in¶
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonical wiki instance. Maple enumerates four failure classes with distinct policies (two infinite, one bounded-at-2, one conditional-with-preprocessing).
Related¶
- patterns/llm-batch-processing-service — the parent service pattern.
- concepts/provider-failure-taxonomy — the input taxonomy.
- concepts/durable-execution — the substrate assumption.
- concepts/llm-batch-api — the API shape where this first canonicalises.
- systems/maple-instacart — canonical system.
- companies/instacart — operator.