Skip to content

PATTERN Cited by 1 source

Batch-then-real-time fallback

Intent

Present a single CSV/Parquet-in, CSV/Parquet-out batch interface to callers regardless of whether the underlying LLM provider offers a native batch API. When a provider has a batch API, use it (for the ~50% cost discount). When a provider is real-time-only, wrap its real-time API behind the same interface with automatic parallelisation + exponential backoff + intelligent retry — so callers never have to know or care.

The pattern is a platform-layer masking of provider-capability heterogeneity. Provider gets a batch API later? Switch at the platform layer; callers don't change.

When to use

  • Running a shared LLM batch processing service that routes to multiple providers.
  • Internal clients want provider choice (different models / vendors for different workloads) but should not have to pick based on "does this vendor support batch?"
  • Small batches are common enough that real-time parallelisation is actually faster than batch APIs for iterating on ops tasks.

Mechanics

The canonical realisation (Maple at Instacart, sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

  • Single caller interface — CSV or Parquet file plus prompt template. No mode flag; no batch-vs-real-time option surfaced upward.
  • Provider-capability table internal to the service — each provider is tagged with its batch-API availability.
  • Batch-capable providers — Maple's usual pipeline (encode → upload → poll → download → merge).
  • Real-time-only providers — Maple's real-time wrapper runs:
  • Automatic parallelisation across concurrent requests (bounded by provider rate limit).
  • Exponential backoff on rate-limited responses.
  • Intelligent retry policies.
  • Failure tracking.
  • Seamless upgrade path — if a provider later ships a batch API, Maple switches the routing at the platform layer; "users don't need to do anything."
  • Small-batch latency benefit — real-time routing "made small batches complete more quickly, which is important for ops-related tasks when they are iterating on a problem."

From the post:

"Teams no longer need to write custom scripts or pipelines to handle bulk real-time calls. Instead, they can use the same Maple interface, and the underlying platform will handle the complexities of interacting with real-time APIs at scale." (Source: sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple)

Why this pattern

The alternatives, all worse:

  1. Force callers to pick based on batch availability — couples application choice to operational detail; forces caller code change every time a provider ships / drops batch.
  2. Only support batch-capable providers — narrows the provider roster; blocks teams from using best-performing models for their workload if those models happen to be real-time-only.
  3. Offer two separate interfaces (batch + real-time) — every caller writes two code paths; code duplication of the sort this pattern is supposed to prevent.

The pattern preserves unified- interface semantics at the batch-processing layer — sibling of patterns/unified-inference-binding (Cloudflare Workers AI at SDK level) and patterns/automatic-provider-failover (Cloudflare AI Gateway at request level). Each operates at a different layer:

  • Unified-inference-binding — one SDK surface, model string selects provider (@cf/meta/llama-3.1-8b-instruct → model string IS the provider selector).
  • Automatic-provider-failover — gateway reroutes on upstream failure across providers that share a model.
  • Batch-then-real-time-fallback — gateway reroutes based on provider capability (does it support batch?), not availability.

Contrast with "batch discount when possible"

A weaker pattern — "try batch; fall back to real-time if batch fails or times out" — exists but is different:

  • Weaker pattern: the caller is still aware of both modes; fallback is an error-handling clause.
  • This pattern: the caller sees one interface; the mode choice is a provider-capability-based routing decision at platform-layer, invisible upward.

Caveats

  • Real-time wrapper costs more than batch — you pay full real-time rates to deliver the batch-shaped interface when the provider doesn't support batch. Platform choice to absorb that cost for interface uniformity.
  • Real-time rate limits are lower than batch throughput; a 10M-prompt job via real-time-only wrapping will be far slower than via batch (minutes per 1K prompts on the real-time path vs 50K prompts per batch at provider throughput on the batch path). The "small batches complete more quickly" property is genuine only for small jobs.
  • Provider rate limits are the real-time wrapper's scaling ceiling; the platform ends up owning the back-off / parallelism strategy.

Seen in

Last updated · 319 distilled / 1,201 read