CONCEPT Cited by 1 source

Provider failure taxonomy¶

Definition¶

A provider failure taxonomy is a classification of the distinct failure modes an external API can return, enabling the caller to choose a class-specific retry policy instead of a one-size-fits-all retry loop. The caller gains two things: (a) the ability to cap retry budget per class (some classes are worth retrying forever; others probably return the same result); (b) the ability to intervene with class-specific preprocessing between retries (e.g. check if an image URL still exists before resubmitting).

Canonical LLM-batch-API taxonomy (Instacart Maple)¶

The 2025-08-27 Maple post enumerates four task-level failure classes an LLM batch API can return, each with its own retry policy (Source):

Class	Cause	Default policy	Rationale
Expired	Provider fails to return within 24 h SLA; returns an error / expired message with no result	Retry infinitely in a new batch	Transient provider-side issue; resubmission is the entire remedy
Rate-limited	Caller hit the provider's token-per-minute ceiling	Retry infinitely	Purely a capacity-scheduling issue; backoff resolves it
Refused	Provider refuses to execute — bad params, content-filtered prompt/image	Retry max 2×	"probably return the same result" — no point burning budget on deterministic-looking refusals
Invalid image	Image URL dead or unreachable at provider	Retry optional; on retry, Maple checks image existence before resubmitting	Per-URL validation has cost; defer to retry #2 only — "checking each image in a large batch can add significant overhead"

Two axes distinguish these classes:

Likelihood the next attempt succeeds: high for expired / rate-limited (transient, pure timing), low for refused (likely deterministic).
Preprocessing cost between retries: zero for expired / rate-limited, non-trivial for invalid-image (HEAD-check each URL). Deferring the check until retry #2 lets the common case (URL was fine, transient glitch) proceed without the overhead.

Why taxonomy matters¶

Without the taxonomy, callers default to either:

Exponential backoff with a retry cap — cheap to build, but gives up too early on transient capacity issues (rate-limits can take minutes-to-hours to clear) and wastes budget on deterministic refusals.
Infinite retry across all errors — burns money on refusals that will never succeed.

The taxonomy lets each class get the policy it deserves. It's the error-handling equivalent of "don't retry 4xx, do retry 5xx" on HTTP, but with more structure: the same HTTP 429 can be a provider-side capacity issue (infinite retry appropriate) or a caller-side misconfiguration (bounded retry).

Generalisation beyond LLMs¶

The pattern is reusable anywhere a provider API returns typed failures. Examples where a similar taxonomy would help:

Email delivery: temporary bounce (retry) vs permanent bounce (don't) vs soft-bounce-quota (infinite retry with backoff).
Payment APIs: insufficient funds (don't retry) vs issuer timeout (infinite retry).
Image CDN: 404 on uncached object (backend might catch up) vs 403 (don't retry).

The LLM-batch case is notable because each class has distinct preprocessing (image-check on retry #2), not just distinct retry budgets. That's the second-order sophistication the taxonomy unlocks.

Seen in¶

sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonical wiki instance. Maple's four-class taxonomy with per-class policies including a deliberately-deferred preprocessing step (image-existence check on retry #2 only).

concepts/llm-batch-api — the API surface that forces this failure taxonomy.
concepts/durable-execution — the substrate that makes multi-batch retry loops safe.
patterns/infinite-retry-by-failure-class — the pattern Maple canonicalises from this taxonomy.
patterns/llm-batch-processing-service — the consolidation service that owns the taxonomy on behalf of callers.
systems/maple-instacart — canonical production instance.