PATTERN Cited by 1 source

Infinite retry by failure class¶

Intent¶

Apply a different retry policy per failure class instead of one- size-fits-all retry-with-backoff — so that genuinely transient failures retry forever, genuinely deterministic failures retry only a bounded number of times, and failures that admit preprocessing retry with that preprocessing applied (but only on the second attempt, to avoid overhead on the common case).

The pattern requires a well-defined provider failure taxonomy — enumerated failure classes with per-class cause-of-failure analysis — as the input on which to branch.

When to use¶

API returns typed failures with distinguishable classes (timeout vs rate-limit vs bad-input vs unreachable-dependency).
Some classes are unboundedly transient (rate-limit, timeout on slow upstream); some are deterministic (content-filtered prompt, malformed input); some admit preprocessing (image-URL check).
Budget to burn on retries is significant (money-per-call or wall-clock SLA).
Durable substrate available (concepts/durable-execution) so infinite-retry loops actually survive caller restarts.

Mechanics¶

Canonical realisation: Instacart Maple for LLM batch API task-level failures (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

Class	Cause	Policy
Expired	Provider fails to return within 24h SLA	Infinite retry (new batch)
Rate-limited	Caller hit provider's token-per-minute limit	Infinite retry
Refused	Provider refuses — bad params, filtered content	Bounded: max 2×
Invalid image	Image URL dead or unreachable	Optional retry; on retry, check image exists before resubmitting — deferred because "checking each image in a large batch can add significant overhead"

Two design choices encoded:

Per-class retry budget — some classes infinite, some bounded.
Per-class preprocessing — image-check deferred to retry #2 only, because applying it preemptively burns CPU + network on the common case where it would have succeeded anyway.

Why not exponential-backoff-with-retry-cap?¶

The naive retry-with-backoff loop:

for attempt in 1..=MAX_RETRIES:
    result = call(input)
    if result.ok: return result
    if result.retriable: sleep(exp(attempt))
    else: return result.err

…is a shape that works when failure classes are indistinguishable. When classes differ materially — rate-limits take minutes to clear, refusals are deterministic, invalid images might need preprocessing — the one-shape policy fails at each end:

Caps too low for capacity-issue retries — 5 attempts with exponential backoff tops out at ~30 seconds. Provider rate limits can last minutes to hours. Caller gives up while the remedy is still time.
Caps too high for deterministic refusals — 5 attempts burn 5× the cost on a refusal that will always refuse.
No preprocessing hook — no place to say "before this retry, validate the URL."

The per-class pattern solves all three by separating policy per class from retry mechanism.

Precondition: the taxonomy has to exist¶

This pattern rests on the API being honest about why a request failed — a concepts/provider-failure-taxonomy with enough classes to distinguish transient from deterministic from preprocessable. LLM batch APIs happen to have good taxonomies (four distinct classes named). If an API returns only {success: bool}, this pattern degrades back to exponential-backoff-with-retry-cap.

Precondition: durable execution¶

Infinite retry loops only make sense on a substrate that won't lose state to process crashes. Maple uses Temporal; Temporal's activity retry + workflow replay lets the loop literally run forever without the service getting stuck "mid-retry-on-an-old- batch" on restart.

Generalisation¶

The pattern is reusable anywhere an API has a per-class failure taxonomy:

Email delivery — permanent-bounce (don't retry) vs temporary-bounce (retry) vs quota-exceeded (infinite retry with backoff).
Payment APIs — insufficient-funds (don't retry), issuer- timeout (retry), card-declined (don't).
Object storage uploads — 5xx (retry), 403 (don't), SlowDown (infinite retry with aggressive backoff).
Image pipelines — UrlNotFound could trigger a HEAD check before retry; Corrupt never retries.

Caveats¶

Infinite retry needs a human escape hatch — wedged batches (genuinely dead provider, dead URL, dead account) consume loop iterations forever. Production deployments usually add some ceiling (max 30 days, max 1000 attempts) plus alerting on retry depth.
Budget accounting — batch LLM APIs bill on submission; the infinite-retry decision is a money decision, not just a reliability decision. Maple's "avoids wasting money on partially completed jobs" framing from Temporal durable execution pairs with this — if Temporal durably tracks which tasks succeeded, infinite retry only spends money on actually-failed tasks.
Class misclassification is expensive — if the provider returns a refused when it should have returned rate-limited, the caller gives up after 2× when it should retry forever. Taxonomy quality is load-bearing.

Seen in¶

sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonical wiki instance. Maple enumerates four failure classes with distinct policies (two infinite, one bounded-at-2, one conditional-with-preprocessing).

patterns/llm-batch-processing-service — the parent service pattern.
concepts/provider-failure-taxonomy — the input taxonomy.
concepts/durable-execution — the substrate assumption.
concepts/llm-batch-api — the API shape where this first canonicalises.
systems/maple-instacart — canonical system.
companies/instacart — operator.