CONCEPT Cited by 1 source
LLM batch API¶
Definition¶
An LLM batch API is a provider-side inference endpoint that accepts a bulk file of prompts (typically JSONL) and returns bulk results asynchronously, often with a 24-hour SLA and ~50% cost discount versus real-time APIs.
Canonical example — the shape disclosed by Instacart's Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):
- Per-batch ceiling: 50,000 prompts OR 200 MB — whichever comes first.
- SLA: 24 hours to return all results.
- Workflow: encode in provider-specific format → upload → poll for completion → download results → parse → retry failed prompts in a new batch.
- Cost: billed on submission, not completion — a partially failed batch costs what it would have cost to succeed.
- Cost savings: up to 50% off vs real-time per-call pricing.
- Failure modes: tasks can individually fail with classes like
expired/rate-limited/refused/invalid-image(concepts/provider-failure-taxonomy).
This shape fits OpenAI's Batch API and Anthropic's Message Batches API (both have 24 h SLA + ~50% discount + per-batch limits); Google's Gemini Batch API ships a similar shape; Maple abstracts over an unnamed provider's API of this shape.
Why it exists¶
LLM inference real-time APIs are optimised for low latency per call. Batch APIs trade latency (minutes-to-24h) for throughput + cost — the provider can fit batched requests into spare capacity cycles, aggregate similar prompts for batching optimisations, and avoid per-request rate-limit handling. The cost discount reflects that trade.
Design consequences for callers¶
The batch workflow is operationally hostile compared to a single HTTP POST:
- Batch-count amplification — a 1M-prompt job becomes at least 20 separate batches at the 50K ceiling; more if size-limited.
- Multi-step state machine — encode / upload / poll / download / decode / merge / retry-failed-tasks-in-new-batch. Every step is a failure point.
- Durable workflow required — a caller process that crashes mid-batch risks paying for the already-submitted batch and losing the knowledge needed to collect results. Canonical concepts/durable-execution motivator.
- Partial-failure handling at task granularity — individual prompts within a batch can fail; caller must decide per-class how to retry (patterns/infinite-retry-by-failure-class).
- Input-size pressure — for jobs that won't fit in memory, caller needs stream-based processing; row-at-a-time CSV parsing becomes load-bearing.
Every team that uses the batch API independently re-implements this state machine. The consolidation play is a shared service — patterns/llm-batch-processing-service — that abstracts the batch API behind a CSV/Parquet-in, CSV/Parquet-out interface.
Batch vs real-time choice¶
| Dimension | Batch API | Real-time API |
|---|---|---|
| Cost | ~50% off | baseline |
| Latency | hours to 24h | seconds |
| Per-call rate limit | aggregated across batch | per-request |
| Error handling | per-task failure classes | per-call |
| Best for | offline ML-training data gen, catalog enrichment, non-interactive pipelines | interactive UX, latency-critical paths, iterative ops |
Platforms that want to unify the caller interface across both — regardless of whether a specific provider ships batch — adopt patterns/batch-then-real-time-fallback.
Seen in¶
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart's Maple service abstracts an unnamed provider's batch API matching the shape above. Canonical disclosure of what teams actually face when consuming an LLM batch API at 10M+ prompt scale: the 50K / 200 MB ceiling, 24 h SLA, four-class failure taxonomy, ~2.6 prompts/sec mean throughput, "up to 50% cost savings vs real-time."
Related¶
- concepts/provider-failure-taxonomy — the error-class vocabulary batch-API consumers navigate.
- concepts/durable-execution — the caller-side property batch workflows force.
- concepts/cost-tracking-per-team — governance primitive that batch usage amplifies.
- concepts/stream-based-file-processing — the memory-safety primitive large-batch input requires.
- patterns/llm-batch-processing-service — the consolidation pattern built on top of this API shape.
- patterns/batch-then-real-time-fallback — unified-interface pattern when provider capability varies.
- patterns/infinite-retry-by-failure-class — per-class retry policy for task-level failures.
- systems/maple-instacart — canonical wiki instance.