Skip to content

CONCEPT Cited by 1 source

Provider failure taxonomy

Definition

A provider failure taxonomy is a classification of the distinct failure modes an external API can return, enabling the caller to choose a class-specific retry policy instead of a one-size-fits-all retry loop. The caller gains two things: (a) the ability to cap retry budget per class (some classes are worth retrying forever; others probably return the same result); (b) the ability to intervene with class-specific preprocessing between retries (e.g. check if an image URL still exists before resubmitting).

Canonical LLM-batch-API taxonomy (Instacart Maple)

The 2025-08-27 Maple post enumerates four task-level failure classes an LLM batch API can return, each with its own retry policy (Source):

Class Cause Default policy Rationale
Expired Provider fails to return within 24 h SLA; returns an error / expired message with no result Retry infinitely in a new batch Transient provider-side issue; resubmission is the entire remedy
Rate-limited Caller hit the provider's token-per-minute ceiling Retry infinitely Purely a capacity-scheduling issue; backoff resolves it
Refused Provider refuses to execute — bad params, content-filtered prompt/image Retry max 2× "probably return the same result" — no point burning budget on deterministic-looking refusals
Invalid image Image URL dead or unreachable at provider Retry optional; on retry, Maple checks image existence before resubmitting Per-URL validation has cost; defer to retry #2 only — "checking each image in a large batch can add significant overhead"

Two axes distinguish these classes:

  • Likelihood the next attempt succeeds: high for expired / rate-limited (transient, pure timing), low for refused (likely deterministic).
  • Preprocessing cost between retries: zero for expired / rate-limited, non-trivial for invalid-image (HEAD-check each URL). Deferring the check until retry #2 lets the common case (URL was fine, transient glitch) proceed without the overhead.

Why taxonomy matters

Without the taxonomy, callers default to either:

  • Exponential backoff with a retry cap — cheap to build, but gives up too early on transient capacity issues (rate-limits can take minutes-to-hours to clear) and wastes budget on deterministic refusals.
  • Infinite retry across all errors — burns money on refusals that will never succeed.

The taxonomy lets each class get the policy it deserves. It's the error-handling equivalent of "don't retry 4xx, do retry 5xx" on HTTP, but with more structure: the same HTTP 429 can be a provider-side capacity issue (infinite retry appropriate) or a caller-side misconfiguration (bounded retry).

Generalisation beyond LLMs

The pattern is reusable anywhere a provider API returns typed failures. Examples where a similar taxonomy would help:

  • Email delivery: temporary bounce (retry) vs permanent bounce (don't) vs soft-bounce-quota (infinite retry with backoff).
  • Payment APIs: insufficient funds (don't retry) vs issuer timeout (infinite retry).
  • Image CDN: 404 on uncached object (backend might catch up) vs 403 (don't retry).

The LLM-batch case is notable because each class has distinct preprocessing (image-check on retry #2), not just distinct retry budgets. That's the second-order sophistication the taxonomy unlocks.

Seen in

Last updated · 319 distilled / 1,201 read