Skip to content

CONCEPT Cited by 1 source

LLM batch API

Definition

An LLM batch API is a provider-side inference endpoint that accepts a bulk file of prompts (typically JSONL) and returns bulk results asynchronously, often with a 24-hour SLA and ~50% cost discount versus real-time APIs.

Canonical example — the shape disclosed by Instacart's Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

  • Per-batch ceiling: 50,000 prompts OR 200 MB — whichever comes first.
  • SLA: 24 hours to return all results.
  • Workflow: encode in provider-specific format → upload → poll for completion → download results → parse → retry failed prompts in a new batch.
  • Cost: billed on submission, not completion — a partially failed batch costs what it would have cost to succeed.
  • Cost savings: up to 50% off vs real-time per-call pricing.
  • Failure modes: tasks can individually fail with classes like expired / rate-limited / refused / invalid-image (concepts/provider-failure-taxonomy).

This shape fits OpenAI's Batch API and Anthropic's Message Batches API (both have 24 h SLA + ~50% discount + per-batch limits); Google's Gemini Batch API ships a similar shape; Maple abstracts over an unnamed provider's API of this shape.

Why it exists

LLM inference real-time APIs are optimised for low latency per call. Batch APIs trade latency (minutes-to-24h) for throughput + cost — the provider can fit batched requests into spare capacity cycles, aggregate similar prompts for batching optimisations, and avoid per-request rate-limit handling. The cost discount reflects that trade.

Design consequences for callers

The batch workflow is operationally hostile compared to a single HTTP POST:

  1. Batch-count amplification — a 1M-prompt job becomes at least 20 separate batches at the 50K ceiling; more if size-limited.
  2. Multi-step state machine — encode / upload / poll / download / decode / merge / retry-failed-tasks-in-new-batch. Every step is a failure point.
  3. Durable workflow required — a caller process that crashes mid-batch risks paying for the already-submitted batch and losing the knowledge needed to collect results. Canonical concepts/durable-execution motivator.
  4. Partial-failure handling at task granularity — individual prompts within a batch can fail; caller must decide per-class how to retry (patterns/infinite-retry-by-failure-class).
  5. Input-size pressure — for jobs that won't fit in memory, caller needs stream-based processing; row-at-a-time CSV parsing becomes load-bearing.

Every team that uses the batch API independently re-implements this state machine. The consolidation play is a shared service — patterns/llm-batch-processing-service — that abstracts the batch API behind a CSV/Parquet-in, CSV/Parquet-out interface.

Batch vs real-time choice

Dimension Batch API Real-time API
Cost ~50% off baseline
Latency hours to 24h seconds
Per-call rate limit aggregated across batch per-request
Error handling per-task failure classes per-call
Best for offline ML-training data gen, catalog enrichment, non-interactive pipelines interactive UX, latency-critical paths, iterative ops

Platforms that want to unify the caller interface across both — regardless of whether a specific provider ships batch — adopt patterns/batch-then-real-time-fallback.

Seen in

  • sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart's Maple service abstracts an unnamed provider's batch API matching the shape above. Canonical disclosure of what teams actually face when consuming an LLM batch API at 10M+ prompt scale: the 50K / 200 MB ceiling, 24 h SLA, four-class failure taxonomy, ~2.6 prompts/sec mean throughput, "up to 50% cost savings vs real-time."
Last updated · 319 distilled / 1,201 read