CONCEPT Cited by 1 source

LLM batch API¶

Definition¶

An LLM batch API is a provider-side inference endpoint that accepts a bulk file of prompts (typically JSONL) and returns bulk results asynchronously, often with a 24-hour SLA and ~50% cost discount versus real-time APIs.

Canonical example — the shape disclosed by Instacart's Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

Per-batch ceiling: 50,000 prompts OR 200 MB — whichever comes first.
SLA: 24 hours to return all results.
Workflow: encode in provider-specific format → upload → poll for completion → download results → parse → retry failed prompts in a new batch.
Cost: billed on submission, not completion — a partially failed batch costs what it would have cost to succeed.
Cost savings: up to 50% off vs real-time per-call pricing.
Failure modes: tasks can individually fail with classes like expired / rate-limited / refused / invalid-image (concepts/provider-failure-taxonomy).

This shape fits OpenAI's Batch API and Anthropic's Message Batches API (both have 24 h SLA + ~50% discount + per-batch limits); Google's Gemini Batch API ships a similar shape; Maple abstracts over an unnamed provider's API of this shape.

Why it exists¶

LLM inference real-time APIs are optimised for low latency per call. Batch APIs trade latency (minutes-to-24h) for throughput + cost — the provider can fit batched requests into spare capacity cycles, aggregate similar prompts for batching optimisations, and avoid per-request rate-limit handling. The cost discount reflects that trade.

Design consequences for callers¶

The batch workflow is operationally hostile compared to a single HTTP POST:

Batch-count amplification — a 1M-prompt job becomes at least 20 separate batches at the 50K ceiling; more if size-limited.
Multi-step state machine — encode / upload / poll / download / decode / merge / retry-failed-tasks-in-new-batch. Every step is a failure point.
Durable workflow required — a caller process that crashes mid-batch risks paying for the already-submitted batch and losing the knowledge needed to collect results. Canonical concepts/durable-execution motivator.
Partial-failure handling at task granularity — individual prompts within a batch can fail; caller must decide per-class how to retry (patterns/infinite-retry-by-failure-class).
Input-size pressure — for jobs that won't fit in memory, caller needs stream-based processing; row-at-a-time CSV parsing becomes load-bearing.

Every team that uses the batch API independently re-implements this state machine. The consolidation play is a shared service — patterns/llm-batch-processing-service — that abstracts the batch API behind a CSV/Parquet-in, CSV/Parquet-out interface.

Batch vs real-time choice¶

Dimension	Batch API	Real-time API
Cost	~50% off	baseline
Latency	hours to 24h	seconds
Per-call rate limit	aggregated across batch	per-request
Error handling	per-task failure classes	per-call
Best for	offline ML-training data gen, catalog enrichment, non-interactive pipelines	interactive UX, latency-critical paths, iterative ops

Platforms that want to unify the caller interface across both — regardless of whether a specific provider ships batch — adopt patterns/batch-then-real-time-fallback.

Seen in¶

sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart's Maple service abstracts an unnamed provider's batch API matching the shape above. Canonical disclosure of what teams actually face when consuming an LLM batch API at 10M+ prompt scale: the 50K / 200 MB ceiling, 24 h SLA, four-class failure taxonomy, ~2.6 prompts/sec mean throughput, "up to 50% cost savings vs real-time."

concepts/provider-failure-taxonomy — the error-class vocabulary batch-API consumers navigate.
concepts/durable-execution — the caller-side property batch workflows force.
concepts/cost-tracking-per-team — governance primitive that batch usage amplifies.
concepts/stream-based-file-processing — the memory-safety primitive large-batch input requires.
patterns/llm-batch-processing-service — the consolidation pattern built on top of this API shape.
patterns/batch-then-real-time-fallback — unified-interface pattern when provider capability varies.
patterns/infinite-retry-by-failure-class — per-class retry policy for task-level failures.
systems/maple-instacart — canonical wiki instance.