CONCEPT Cited by 1 source
Provider failure taxonomy¶
Definition¶
A provider failure taxonomy is a classification of the distinct failure modes an external API can return, enabling the caller to choose a class-specific retry policy instead of a one-size-fits-all retry loop. The caller gains two things: (a) the ability to cap retry budget per class (some classes are worth retrying forever; others probably return the same result); (b) the ability to intervene with class-specific preprocessing between retries (e.g. check if an image URL still exists before resubmitting).
Canonical LLM-batch-API taxonomy (Instacart Maple)¶
The 2025-08-27 Maple post enumerates four task-level failure classes an LLM batch API can return, each with its own retry policy (Source):
| Class | Cause | Default policy | Rationale |
|---|---|---|---|
| Expired | Provider fails to return within 24 h SLA; returns an error / expired message with no result | Retry infinitely in a new batch | Transient provider-side issue; resubmission is the entire remedy |
| Rate-limited | Caller hit the provider's token-per-minute ceiling | Retry infinitely | Purely a capacity-scheduling issue; backoff resolves it |
| Refused | Provider refuses to execute — bad params, content-filtered prompt/image | Retry max 2× | "probably return the same result" — no point burning budget on deterministic-looking refusals |
| Invalid image | Image URL dead or unreachable at provider | Retry optional; on retry, Maple checks image existence before resubmitting | Per-URL validation has cost; defer to retry #2 only — "checking each image in a large batch can add significant overhead" |
Two axes distinguish these classes:
- Likelihood the next attempt succeeds: high for expired / rate-limited (transient, pure timing), low for refused (likely deterministic).
- Preprocessing cost between retries: zero for expired / rate-limited, non-trivial for invalid-image (HEAD-check each URL). Deferring the check until retry #2 lets the common case (URL was fine, transient glitch) proceed without the overhead.
Why taxonomy matters¶
Without the taxonomy, callers default to either:
- Exponential backoff with a retry cap — cheap to build, but gives up too early on transient capacity issues (rate-limits can take minutes-to-hours to clear) and wastes budget on deterministic refusals.
- Infinite retry across all errors — burns money on refusals that will never succeed.
The taxonomy lets each class get the policy it deserves. It's the error-handling equivalent of "don't retry 4xx, do retry 5xx" on HTTP, but with more structure: the same HTTP 429 can be a provider-side capacity issue (infinite retry appropriate) or a caller-side misconfiguration (bounded retry).
Generalisation beyond LLMs¶
The pattern is reusable anywhere a provider API returns typed failures. Examples where a similar taxonomy would help:
- Email delivery: temporary bounce (retry) vs permanent bounce (don't) vs soft-bounce-quota (infinite retry with backoff).
- Payment APIs: insufficient funds (don't retry) vs issuer timeout (infinite retry).
- Image CDN: 404 on uncached object (backend might catch up) vs 403 (don't retry).
The LLM-batch case is notable because each class has distinct preprocessing (image-check on retry #2), not just distinct retry budgets. That's the second-order sophistication the taxonomy unlocks.
Seen in¶
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonical wiki instance. Maple's four-class taxonomy with per-class policies including a deliberately-deferred preprocessing step (image-existence check on retry #2 only).
Related¶
- concepts/llm-batch-api — the API surface that forces this failure taxonomy.
- concepts/durable-execution — the substrate that makes multi-batch retry loops safe.
- patterns/infinite-retry-by-failure-class — the pattern Maple canonicalises from this taxonomy.
- patterns/llm-batch-processing-service — the consolidation service that owns the taxonomy on behalf of callers.
- systems/maple-instacart — canonical production instance.