Skip to content

PATTERN Cited by 1 source

LLM batch processing service

Intent

Consolidate bulk LLM inference workloads — jobs of millions of prompts run offline against an LLM provider's batch API — into a single internal service that exposes a file-in / file-out RPC, hiding the provider's batch-API workflow (encode → upload → poll → download → parse → retry-failed-in-new-batch) and letting internal teams ship new LLM-driven pipelines without becoming LLM infrastructure experts.

The pattern replaces the common prior approach — every team writes its own Python script that calls the batch API — with a shared platform that owns the reliability story, the failure-handling policy, the storage shape, and the cost accounting.

When to use

  • Organisation is running ≥10M-prompt LLM jobs (or aggregate across teams) offline — catalog cleaning, attribute extraction, ML training data gen, ranking-model training, classification at scale.
  • Cost matters — batch APIs give ~50% vs real-time; at 10M- prompt scale that's hundreds of thousands of dollars.
  • Multiple internal teams with similar-shape workloads are each writing their own batch-API-plumbing code — duplication signal.
  • Workloads are heterogeneous in provider choice — different teams prefer different models / providers for their use case; the service can mask provider-specific batch API quirks.

When not to use

  • Workload is entirely real-time / interactive — batch APIs don't help (latency is wrong).
  • Single team with a single workload — pay the platform cost on day 2, not day 1.
  • Prompts small enough that the 50K-batch ceiling never bites (< 100K prompts per job, no team duplication).

Mechanics

The canonical realisation (Maple at Instacart, sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

  1. Single RPC API — caller submits a CSV or Parquet file plus a prompt template; gets back a job ID.
  2. Split + encode — input is streamed from S3 (or similar blob store), split into batches respecting the provider's 50K-prompt / 200 MB limit, encoded into the provider's batch file format. Intermediate batches stored as Parquet for compression + random-access reads.
  3. Submit + poll — each batch uploaded to the provider through the AI Gateway; polled for completion (hours to 24h).
  4. Download + merge — per-batch results downloaded as they complete, matched to input rows by task ID, written as per-batch result Parquets. All per-batch results merged into a single output file mirroring the input format.
  5. Per-class retry — task-level failures classified by concepts/provider-failure-taxonomy and retried per patterns/infinite-retry-by-failure-class.
  6. Durable workflow substrate — entire pipeline runs as a Temporal workflow so crashes, deploys, and platform restarts don't lose work or re-spend on already-submitted batches (concepts/durable-execution).
  7. Stream-based processing throughout — large inputs never fully materialised in RAM (concepts/stream-based-file-processing).
  8. Cost tracking — every provider call routed through the AI Gateway, which logs per-team cost attribution.

Extensions

  • patterns/batch-then-real-time-fallback — wrap real-time-only providers behind the same CSV interface with auto-parallelisation
  • exponential backoff; when the provider eventually ships a batch API, switch at the platform layer without user-visible change.
  • Prompt-template library — share few-shot exemplars across teams (patterns/prompt-template-library).
  • Concurrent batch submission — within a single job, submit multiple batches in parallel to cut end-to-end completion time (bounded by provider's concurrent-batch limit).

Outcomes (reported by Maple)

  • Scale: 10M+ prompt jobs routinely handled.
  • Cost: "Many processes have been reduced from hundreds of thousands of dollars per year to just thousands of dollars per year."
  • Platform leverage: Catalog, Fulfillment, Search, and ML-training teams all share one service.

Contrast

Related to and distinct from:

  • patterns/llm-attribute-extraction-platform (Instacart PARSE) — extraction-specific platform above LLM inference. Plausibly a caller of Maple when extraction scales to full-catalog jobs.
  • patterns/unified-image-generation-platform (Instacart PIXEL) — image-gen-specific platform with unified-parameter-protocol + VLM-evaluator quality gate. Same company, same consolidation stance, different modality.
  • patterns/centralized-embedding-platform (Expedia) — embedding- specific platform with similar "stop every team from DIY'ing this" framing at a different layer of the ML stack.
  • patterns/ai-gateway-provider-abstraction (Cloudflare AI Gateway, Databricks Unity AI Gateway) — the tier below an LLM batch processing service. AI Gateway handles provider routing, key injection, cost tracking. LLM batch service handles workflow orchestration, retry policy, file I/O.

The pattern stack, top to bottom:

Caller team (Catalog, Fulfillment, etc.)
LLM batch processing service (Maple)     ← this pattern
    │  CSV/Parquet in, CSV/Parquet out
AI Gateway (provider abstraction)        ← patterns/ai-gateway-provider-abstraction
    │  unified endpoint, key injection, cost tracking
External LLM provider (OpenAI, Anthropic, etc.)
    │  native batch API (50K/200MB/24h)

Seen in

Last updated · 319 distilled / 1,201 read