Instacart — Simplifying Large-Scale LLM Processing across Instacart with Maple¶
Summary¶
Instacart Engineering post (2025-08-27) describing Maple — an internal batch-LLM-processing service that turns millions-of-prompt jobs into a CSV/Parquet in / CSV/Parquet out interface, abstracting the LLM provider's 50K-prompt / 200 MB-per-batch batch API into a single RPC. Maple runs on Temporal for durable execution, stores inputs / intermediate batches / outputs on S3 as Parquet (25× compression vs CSV + columnar random access), proxies through an Instacart AI Gateway (distinct from the LLM provider) that integrates with a Cost Tracker for per-team usage accounting, and implements failure-class-specific retry policies (infinite for rate-limit + expired, bounded for refused, image-URL-check-on-retry for invalid-image). Reported outcomes: ~50% cost reduction vs real-time LLM calls, scale to 10M+ prompt jobs, batch throughput measured at ~2.6 prompts/sec avg with most batches completing in under 12 hours across a sample of ~580 batches at 40–50K tasks/batch. Maple was later extended to wrap non-batch (real-time-only) providers behind the same CSV interface with automatic parallelisation + exponential backoff — useful for ops-iteration-friendly small batches. Canonical batch-LLM-platform sibling of the text-AI-Gateway (Cloudflare / Databricks) + image-AI-platform (PIXEL) pattern graph: same "stop every team from DIY'ing this" ML-platform consolidation play, at the batch-inference layer.
Key takeaways¶
-
Batch inference APIs are economically transformative but operationally hostile. LLM provider batch endpoints promise "up to 50% cost reduction vs real-time" but expose a 50K-prompt / 200 MB per-batch ceiling — a 1M-prompt job means at least 20 separate batches, each requiring encode → upload → status-poll → download → parse → retry-failed → repeat. Every team that tried to use them independently re-wrote this workflow. Maple consolidates it into a CSV/Parquet in, merged-output out RPC. (Source: Maple)
-
Temporal is the load-bearing substrate. Every activity in the pipeline (encode, upload, poll, download, decode, merge) is a Temporal activity; the overall job is a Temporal workflow. This is a canonical production instance of concepts/durable-execution applied to long-running batch pipelines: "Even if exceptions occur, Temporal's fault tolerance safeguards data integrity and guarantees job completion." Instacart's specific payoff: "protects against data loss but also avoids wasting money on partially completed jobs" — the cost angle is load-bearing because LLM batch inference is paid at batch-submit time, not at batch-complete time.
-
S3-Parquet, not a database, for intermediate state. Inputs, per-batch splits, per-batch outputs, and final merged outputs all live on S3 as Parquet files. Stated rationale: "avoiding costly database operations … not only cheaper but also allows handling large datasets." Parquet-specific wins disclosed: up to 25× size reduction vs CSV (per-column compression), non-linear (random-access) reads into the file — the columnar property is used at merge time, not just at archival. patterns/metadata-plus-chunk-storage-stack at the batch-job granularity.
-
The AI-Gateway layer is two-tier, not one-tier. Maple proxies all its LLM calls through an Instacart AI Gateway (internal service), which in turn routes to the external LLM provider + logs usage to a Cost Tracker. This is the classical patterns/ai-gateway-provider-abstraction pattern (Cloudflare / Databricks are the text-LLM siblings), but Maple is a consumer of the AI Gateway rather than the AI Gateway itself — the batch-processing layer sits above the provider-abstraction layer, which sits above the provider. Each concern lives at the layer where it is easiest to build once: batch pipeline = Maple, provider routing + cost tracking = AI Gateway, inference = LLM provider.
-
Failure-class-specific retry policy is the heart of the reliability story. The post enumerates four task-level failure modes, each with its own policy: (a) Expired (provider fails to return within 24 h) → retry infinitely by default (construct a new batch with failed tasks); (b) Rate-limited (provider token-limit exceeded) → retry infinitely by default; (c) Refused (bad params, filtered image/prompt) → retry max 2× default ("probably return the same result" otherwise); (d) Invalid image (image URL dead or unreachable) → retry option that checks image existence before resubmitting, but only on the second attempt (checking every URL on the first pass "can add significant overhead"). Canonical instance of patterns/infinite-retry-by-failure-class — the retry policy is a function of which failure, not one-size-fits-all.
-
Performance disclosures, from a ~580-batch / 40–50K-tasks-each sample — real production data, not vendor-quoted: mean throughput ~2.6 prompts/sec per batch (histogram clustered 1–4 prompts/sec), most batches complete in under 12 hours (some take nearly the full 24 h SLA), completion time increases with job size (positive slope on a log-y-axis scatter plot). "Processing time can vary based on the prompt, especially when including images with the prompt, which is a common case for us." This is the wiki's first ingested benchmark of production LLM batch-API latency at scale.
-
Scale optimizations were not-theoretical, they were forced. Three specific upgrades named: (a) moved task data from DB to S3 Parquet as input sizes grew; (b) adopted stream-based processing to bound memory consumption (classic concepts/stream-based-file-processing move — don't load a 1M-prompt CSV into RAM); (c) replaced Python
jsonstdlib with orjson — "faster and more memory-efficient alternative." Each swap is small but the canonical lesson is: "as our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations." These optimizations "allowed Maple to scale efficiently to 10M+ prompt jobs." -
Batch-then-real-time-fallback unifies the interface across provider capabilities. Not all LLM providers offer batch; some are real-time-only. Rather than force teams to pick a provider based on batch availability, Maple wraps real-time-only providers behind the same CSV/Parquet interface with automatic parallelisation, exponential backoff on rate limits, intelligent retry policies, and failure tracking. "If a provider starts offering a batch interface, we can switch it over seamlessly without our users needing to do anything." Canonical patterns/batch-then-real-time-fallback — the caller interface is batch-shaped regardless of what the underlying provider supports, and small batches complete faster under real-time routing (useful for iterative ops tasks). The platform hides provider-capability heterogeneity behind a stable CSV contract — the batch-layer sibling of concepts/unified-parameter-protocol (PIXEL's image-gen version at parameter level) and patterns/unified-inference-binding (Cloudflare Workers AI text-LLM version at SDK level).
-
Platform-level investments compound across teams. Post names the Catalog, Fulfillment, and Search teams as distinct Maple consumers with different workloads (catalog data-cleaning + attribute enrichment; perishable-item routing; ranking-model training). "Many processes have been reduced from hundreds of thousands of dollars per year to just thousands of dollars per year." Canonical concepts/model-agnostic-ml-platform consolidation claim: "Maple democratises access to bulk LLM prompt processing at Instacart. Teams can now explore new ideas, automate repetitive work, and ship faster — without becoming LLM infrastructure experts."
Architectural shape¶
┌────────────────────────────────────────┐
│ Internal client (Catalog / Fulfillment │
│ / Search / Ads team) │
└──────────────────────┬──────────────────┘
│ CSV or Parquet file + prompt template
│ (RPC API)
▼
┌──────────────────────────────────┐
│ Maple service (Python, PyArrow) │
│ — [Temporal](<../systems/temporal.md>) workflow │
│ — S3-parquet intermediates │
│ — per-batch 50K-prompt / 200 MB │
│ split │
│ — stream-based processing │
│ — orjson for JSON parsing │
│ — failure-class retry policy │
└───────────────┬──────────────────┘
│ encoded batch file (LLM provider batch format)
▼
┌──────────────────────────────────┐
│ Instacart AI Gateway │
│ — provider routing │
│ — Cost Tracker integration │
│ — per-team spend attribution │
└───────────────┬──────────────────┘
│ provider-specific batch or real-time API
▼
┌──────────────────────────────────┐
│ External LLM provider │
│ — 50K-prompt / 200 MB batch cap │
│ — 24h SLA │
│ — ~2.6 tasks/sec mean throughput│
└──────────────────────────────────┘
Intermediate storage shape:
Input CSV (client-provided)
↓ split + encode
Batch 1 Parquet (≤ 50K prompts) → provider batch upload
Batch 2 Parquet → provider batch upload
... → ...
Batch N Parquet → provider batch upload
↓ poll for completion
Download results
↓ decode + join by task ID
Per-batch result Parquet 1
Per-batch result Parquet 2
...
Per-batch result Parquet N
↓ merge
Final output file (CSV or Parquet, mirrors input format)
Numbers disclosed¶
- Batch ceiling: 50,000 prompts OR 200 MB per batch (LLM provider constraint).
- Scale: 10M+ prompt jobs handled by Maple; at least 20 batches needed for a 1M-prompt job.
- Cost: "up to 50% on LLM costs compared to standard real-time calls" — savings from batch vs real-time, not from Maple per se.
- Production sample: ~580 batches, 40–50K tasks per batch (most at 50K).
- Throughput: mean 2.6 prompts/sec per batch, distribution clustered 1–4 prompts/sec.
- Batch completion time: most batches complete in < 12 h; some approach the 24 h SLA.
- Provider SLA: 24 hours per batch.
- Size reduction: Parquet claims up to 25× smaller than CSV (Instacart's attribution; intrinsic Parquet property).
- Cost savings per workload: "hundreds of thousands of dollars per year to just thousands of dollars per year" on specific processes.
Numbers not disclosed¶
- Maple's own service latency / throughput / QPS.
- Temporal workflow / activity counts per job.
- Concurrency across batches (serial vs parallel batch submission).
- Real-time fallback path concurrency limits + per-provider rate-limit ceilings.
- Per-team Cost Tracker numbers (absolute or relative).
- AI-Gateway fan-out factor / provider count / which providers are integrated.
- Specific LLM model(s) used (no OpenAI / Anthropic / Google / Mistral vendor name).
- Error-rate breakdown across the four failure classes.
- Image-check cost — the "significant overhead" rationale for deferring to retry #2 is qualitative.
- orjson switchover impact (memory or latency delta).
- DB-to-Parquet migration impact (costs before/after).
- Temporal-specific ops datapoints (worker count, activity retry budget, durable-timer usage for 24h polling).
Caveats¶
- Announcement-voice post — architecture section is solid but many implementation details gestured at rather than specified (Temporal activity decomposition, error-handling state machine, polling cadence, parallel-batch submission strategy).
- LLM provider is unnamed — the 50K-prompt / 200 MB / 24 h SLA fits OpenAI's Batch API closely, but Anthropic's Message Batches API has similar constraints; post avoids vendor commitment. Matters because some of Maple's design choices (e.g. 50% savings, 24 h SLA, batch-vs-real-time capability split) are inherited from a specific provider's API surface, not general to all LLM providers.
- "AI Gateway" is underspecified — the post names it, describes it as proxying requests and integrating with Cost Tracker, but does not disclose its architecture (is it a single worker? a fleet? per-region? does it do caching / semantic caching / model fallback like systems/cloudflare-ai-gateway?). The PIXEL post (sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform) also references "existing Instacart infra" without disclosing the AI Gateway level.
- No failure mode analysis on Temporal itself — if Temporal is partitioned / unavailable, Maple jobs stall. Post doesn't address this (may have been deemed out of scope for a feature-overview post).
- Sample size caveats — the ~580-batch sample is real but the prompt mix isn't characterised (how many include images? what modalities? what model?); throughput numbers don't generalise beyond Instacart's prompt-image mix.
- Batch-API vs real-time choice is presented as either-or, but the real-time-fallback path is only for providers without a batch API — not a cost-optimisation fallback for small batches on providers with both. The "small batches complete more quickly" framing is about the real-time-only-providers path, not a Maple design knob.
- Python-specific scale ceiling not named — stream-processing + orjson + PyArrow handle most of the memory problem, but 10M+ prompt jobs at a single Maple Temporal worker's GIL boundary is plausibly a bottleneck. Post doesn't say how Maple scales horizontally.
Source¶
- Original: https://tech.instacart.com/simplifying-large-scale-llm-processing-across-instacart-with-maple-63df4508d5be?source=rss----587883b5d2ee---4
- Raw markdown:
raw/instacart/2025-08-27-simplifying-large-scale-llm-processing-across-instacart-with-7fe37df1.md
Related¶
- companies/instacart — operator.
- systems/maple-instacart — the system this post introduces.
- systems/instacart-ai-gateway — proxy tier Maple calls through.
- systems/instacart-cost-tracker — per-team usage accounting system.
- systems/temporal — durable-execution substrate.
- systems/aws-s3 — intermediate + final storage.
- systems/apache-parquet — intermediate file format (25× vs CSV, columnar random access).
- concepts/durable-execution — Temporal's motivating property.
- concepts/llm-batch-api — the provider API Maple abstracts.
- concepts/provider-failure-taxonomy — Maple's four-class failure framing (expired / rate-limited / refused / invalid-image).
- concepts/stream-based-file-processing — memory-bound input handling.
- concepts/cost-tracking-per-team — AI-Gateway-level accounting primitive.
- patterns/llm-batch-processing-service — new canonical pattern this post introduces.
- patterns/batch-then-real-time-fallback — unified-interface pattern for heterogeneous provider capability.
- patterns/infinite-retry-by-failure-class — the class-dependent retry policy.
- patterns/ai-gateway-provider-abstraction — the tier below Maple.
- patterns/csv-in-parquet-intermediate-output-merge — the storage-shape pattern.
- sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — sibling Instacart platform post — PIXEL is the image-generation ML-platform consolidation; Maple is the batch-inference ML-platform consolidation. Same company, same "stop every team from DIY'ing this" architectural stance, same AI-Gateway-below / platform-above layering.
- sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — sibling Instacart platform post — PARSE is the attribute-extraction platform; explicitly names Catalog + Fulfillment as teams, same as Maple. PARSE + Maple likely compose in production (PARSE calls → millions of prompts → Maple batch dispatch).
- sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — text-LLM sibling of the provider-abstraction shape Maple sits above.
- sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — coding-agent sibling of the provider-abstraction shape.
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — another canonical wiki instance of Temporal-as-automation-layer for a batch-pipeline provisioning workflow; different domain (CDC vs LLM batches), same substrate rationale.