Skip to content

PATTERN Cited by 1 source

Infinite retry by failure class

Intent

Apply a different retry policy per failure class instead of one- size-fits-all retry-with-backoff — so that genuinely transient failures retry forever, genuinely deterministic failures retry only a bounded number of times, and failures that admit preprocessing retry with that preprocessing applied (but only on the second attempt, to avoid overhead on the common case).

The pattern requires a well-defined provider failure taxonomy — enumerated failure classes with per-class cause-of-failure analysis — as the input on which to branch.

When to use

  • API returns typed failures with distinguishable classes (timeout vs rate-limit vs bad-input vs unreachable-dependency).
  • Some classes are unboundedly transient (rate-limit, timeout on slow upstream); some are deterministic (content-filtered prompt, malformed input); some admit preprocessing (image-URL check).
  • Budget to burn on retries is significant (money-per-call or wall-clock SLA).
  • Durable substrate available (concepts/durable-execution) so infinite-retry loops actually survive caller restarts.

Mechanics

Canonical realisation: Instacart Maple for LLM batch API task-level failures (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

Class Cause Policy
Expired Provider fails to return within 24h SLA Infinite retry (new batch)
Rate-limited Caller hit provider's token-per-minute limit Infinite retry
Refused Provider refuses — bad params, filtered content Bounded: max 2×
Invalid image Image URL dead or unreachable Optional retry; on retry, check image exists before resubmitting — deferred because "checking each image in a large batch can add significant overhead"

Two design choices encoded:

  • Per-class retry budget — some classes infinite, some bounded.
  • Per-class preprocessing — image-check deferred to retry #2 only, because applying it preemptively burns CPU + network on the common case where it would have succeeded anyway.

Why not exponential-backoff-with-retry-cap?

The naive retry-with-backoff loop:

for attempt in 1..=MAX_RETRIES:
    result = call(input)
    if result.ok: return result
    if result.retriable: sleep(exp(attempt))
    else: return result.err

…is a shape that works when failure classes are indistinguishable. When classes differ materially — rate-limits take minutes to clear, refusals are deterministic, invalid images might need preprocessing — the one-shape policy fails at each end:

  • Caps too low for capacity-issue retries — 5 attempts with exponential backoff tops out at ~30 seconds. Provider rate limits can last minutes to hours. Caller gives up while the remedy is still time.
  • Caps too high for deterministic refusals — 5 attempts burn 5× the cost on a refusal that will always refuse.
  • No preprocessing hook — no place to say "before this retry, validate the URL."

The per-class pattern solves all three by separating policy per class from retry mechanism.

Precondition: the taxonomy has to exist

This pattern rests on the API being honest about why a request failed — a concepts/provider-failure-taxonomy with enough classes to distinguish transient from deterministic from preprocessable. LLM batch APIs happen to have good taxonomies (four distinct classes named). If an API returns only {success: bool}, this pattern degrades back to exponential-backoff-with-retry-cap.

Precondition: durable execution

Infinite retry loops only make sense on a substrate that won't lose state to process crashes. Maple uses Temporal; Temporal's activity retry + workflow replay lets the loop literally run forever without the service getting stuck "mid-retry-on-an-old- batch" on restart.

Generalisation

The pattern is reusable anywhere an API has a per-class failure taxonomy:

  • Email delivery — permanent-bounce (don't retry) vs temporary-bounce (retry) vs quota-exceeded (infinite retry with backoff).
  • Payment APIs — insufficient-funds (don't retry), issuer- timeout (retry), card-declined (don't).
  • Object storage uploads — 5xx (retry), 403 (don't), SlowDown (infinite retry with aggressive backoff).
  • Image pipelinesUrlNotFound could trigger a HEAD check before retry; Corrupt never retries.

Caveats

  • Infinite retry needs a human escape hatch — wedged batches (genuinely dead provider, dead URL, dead account) consume loop iterations forever. Production deployments usually add some ceiling (max 30 days, max 1000 attempts) plus alerting on retry depth.
  • Budget accounting — batch LLM APIs bill on submission; the infinite-retry decision is a money decision, not just a reliability decision. Maple's "avoids wasting money on partially completed jobs" framing from Temporal durable execution pairs with this — if Temporal durably tracks which tasks succeeded, infinite retry only spends money on actually-failed tasks.
  • Class misclassification is expensive — if the provider returns a refused when it should have returned rate-limited, the caller gives up after 2× when it should retry forever. Taxonomy quality is load-bearing.

Seen in

Last updated · 319 distilled / 1,201 read