INSTACART 2025-08-27 Tier 2

Instacart — Simplifying Large-Scale LLM Processing across Instacart with Maple¶

Summary¶

Instacart Engineering post (2025-08-27) describing Maple — an internal batch-LLM-processing service that turns millions-of-prompt jobs into a CSV/Parquet in / CSV/Parquet out interface, abstracting the LLM provider's 50K-prompt / 200 MB-per-batch batch API into a single RPC. Maple runs on Temporal for durable execution, stores inputs / intermediate batches / outputs on S3 as Parquet (25× compression vs CSV + columnar random access), proxies through an Instacart AI Gateway (distinct from the LLM provider) that integrates with a Cost Tracker for per-team usage accounting, and implements failure-class-specific retry policies (infinite for rate-limit + expired, bounded for refused, image-URL-check-on-retry for invalid-image). Reported outcomes: ~50% cost reduction vs real-time LLM calls, scale to 10M+ prompt jobs, batch throughput measured at ~2.6 prompts/sec avg with most batches completing in under 12 hours across a sample of ~580 batches at 40–50K tasks/batch. Maple was later extended to wrap non-batch (real-time-only) providers behind the same CSV interface with automatic parallelisation + exponential backoff — useful for ops-iteration-friendly small batches. Canonical batch-LLM-platform sibling of the text-AI-Gateway (Cloudflare / Databricks) + image-AI-platform (PIXEL) pattern graph: same "stop every team from DIY'ing this" ML-platform consolidation play, at the batch-inference layer.

Key takeaways¶

Batch inference APIs are economically transformative but operationally hostile. LLM provider batch endpoints promise "up to 50% cost reduction vs real-time" but expose a 50K-prompt / 200 MB per-batch ceiling — a 1M-prompt job means at least 20 separate batches, each requiring encode → upload → status-poll → download → parse → retry-failed → repeat. Every team that tried to use them independently re-wrote this workflow. Maple consolidates it into a CSV/Parquet in, merged-output out RPC. (Source: Maple)
Temporal is the load-bearing substrate. Every activity in the pipeline (encode, upload, poll, download, decode, merge) is a Temporal activity; the overall job is a Temporal workflow. This is a canonical production instance of concepts/durable-execution applied to long-running batch pipelines: "Even if exceptions occur, Temporal's fault tolerance safeguards data integrity and guarantees job completion." Instacart's specific payoff: "protects against data loss but also avoids wasting money on partially completed jobs" — the cost angle is load-bearing because LLM batch inference is paid at batch-submit time, not at batch-complete time.
S3-Parquet, not a database, for intermediate state. Inputs, per-batch splits, per-batch outputs, and final merged outputs all live on S3 as Parquet files. Stated rationale: "avoiding costly database operations … not only cheaper but also allows handling large datasets." Parquet-specific wins disclosed: up to 25× size reduction vs CSV (per-column compression), non-linear (random-access) reads into the file — the columnar property is used at merge time, not just at archival. patterns/metadata-plus-chunk-storage-stack at the batch-job granularity.
The AI-Gateway layer is two-tier, not one-tier. Maple proxies all its LLM calls through an Instacart AI Gateway (internal service), which in turn routes to the external LLM provider + logs usage to a Cost Tracker. This is the classical patterns/ai-gateway-provider-abstraction pattern (Cloudflare / Databricks are the text-LLM siblings), but Maple is a consumer of the AI Gateway rather than the AI Gateway itself — the batch-processing layer sits above the provider-abstraction layer, which sits above the provider. Each concern lives at the layer where it is easiest to build once: batch pipeline = Maple, provider routing + cost tracking = AI Gateway, inference = LLM provider.
Failure-class-specific retry policy is the heart of the reliability story. The post enumerates four task-level failure modes, each with its own policy: (a) Expired (provider fails to return within 24 h) → retry infinitely by default (construct a new batch with failed tasks); (b) Rate-limited (provider token-limit exceeded) → retry infinitely by default; (c) Refused (bad params, filtered image/prompt) → retry max 2× default ("probably return the same result" otherwise); (d) Invalid image (image URL dead or unreachable) → retry option that checks image existence before resubmitting, but only on the second attempt (checking every URL on the first pass "can add significant overhead"). Canonical instance of patterns/infinite-retry-by-failure-class — the retry policy is a function of which failure, not one-size-fits-all.
Performance disclosures, from a ~580-batch / 40–50K-tasks-each sample — real production data, not vendor-quoted: mean throughput ~2.6 prompts/sec per batch (histogram clustered 1–4 prompts/sec), most batches complete in under 12 hours (some take nearly the full 24 h SLA), completion time increases with job size (positive slope on a log-y-axis scatter plot). "Processing time can vary based on the prompt, especially when including images with the prompt, which is a common case for us." This is the wiki's first ingested benchmark of production LLM batch-API latency at scale.
Scale optimizations were not-theoretical, they were forced. Three specific upgrades named: (a) moved task data from DB to S3 Parquet as input sizes grew; (b) adopted stream-based processing to bound memory consumption (classic concepts/stream-based-file-processing move — don't load a 1M-prompt CSV into RAM); (c) replaced Python json stdlib with orjson — "faster and more memory-efficient alternative." Each swap is small but the canonical lesson is: "as our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations." These optimizations "allowed Maple to scale efficiently to 10M+ prompt jobs."
Batch-then-real-time-fallback unifies the interface across provider capabilities. Not all LLM providers offer batch; some are real-time-only. Rather than force teams to pick a provider based on batch availability, Maple wraps real-time-only providers behind the same CSV/Parquet interface with automatic parallelisation, exponential backoff on rate limits, intelligent retry policies, and failure tracking. "If a provider starts offering a batch interface, we can switch it over seamlessly without our users needing to do anything." Canonical patterns/batch-then-real-time-fallback — the caller interface is batch-shaped regardless of what the underlying provider supports, and small batches complete faster under real-time routing (useful for iterative ops tasks). The platform hides provider-capability heterogeneity behind a stable CSV contract — the batch-layer sibling of concepts/unified-parameter-protocol (PIXEL's image-gen version at parameter level) and patterns/unified-inference-binding (Cloudflare Workers AI text-LLM version at SDK level).
Platform-level investments compound across teams. Post names the Catalog, Fulfillment, and Search teams as distinct Maple consumers with different workloads (catalog data-cleaning + attribute enrichment; perishable-item routing; ranking-model training). "Many processes have been reduced from hundreds of thousands of dollars per year to just thousands of dollars per year." Canonical concepts/model-agnostic-ml-platform consolidation claim: "Maple democratises access to bulk LLM prompt processing at Instacart. Teams can now explore new ideas, automate repetitive work, and ship faster — without becoming LLM infrastructure experts."

Architectural shape¶

┌────────────────────────────────────────┐
│  Internal client (Catalog / Fulfillment │
│  / Search / Ads team)                   │
└──────────────────────┬──────────────────┘
          │ CSV or Parquet file + prompt template
          │ (RPC API)
          ▼
  ┌──────────────────────────────────┐
  │  Maple service (Python, PyArrow) │
  │  — [Temporal](<../systems/temporal.md>) workflow                │
  │  — S3-parquet intermediates      │
  │  — per-batch 50K-prompt / 200 MB │
  │    split                         │
  │  — stream-based processing       │
  │  — orjson for JSON parsing       │
  │  — failure-class retry policy    │
  └───────────────┬──────────────────┘
          │ encoded batch file (LLM provider batch format)
          ▼
  ┌──────────────────────────────────┐
  │  Instacart AI Gateway            │
  │  — provider routing              │
  │  — Cost Tracker integration      │
  │  — per-team spend attribution    │
  └───────────────┬──────────────────┘
          │ provider-specific batch or real-time API
          ▼
  ┌──────────────────────────────────┐
  │  External LLM provider           │
  │  — 50K-prompt / 200 MB batch cap │
  │  — 24h SLA                       │
  │  — ~2.6 tasks/sec mean throughput│
  └──────────────────────────────────┘

Intermediate storage shape:

Input CSV (client-provided)
  ↓ split + encode
Batch 1 Parquet (≤ 50K prompts)  →  provider batch upload
Batch 2 Parquet                  →  provider batch upload
...                              →  ...
Batch N Parquet                  →  provider batch upload
                                    ↓ poll for completion
                                  Download results
                                    ↓ decode + join by task ID
Per-batch result Parquet 1
Per-batch result Parquet 2
...
Per-batch result Parquet N
  ↓ merge
Final output file (CSV or Parquet, mirrors input format)

Numbers disclosed¶

Batch ceiling: 50,000 prompts OR 200 MB per batch (LLM provider constraint).
Scale: 10M+ prompt jobs handled by Maple; at least 20 batches needed for a 1M-prompt job.
Cost: "up to 50% on LLM costs compared to standard real-time calls" — savings from batch vs real-time, not from Maple per se.
Production sample: ~580 batches, 40–50K tasks per batch (most at 50K).
Throughput: mean 2.6 prompts/sec per batch, distribution clustered 1–4 prompts/sec.
Batch completion time: most batches complete in < 12 h; some approach the 24 h SLA.
Provider SLA: 24 hours per batch.
Size reduction: Parquet claims up to 25× smaller than CSV (Instacart's attribution; intrinsic Parquet property).
Cost savings per workload: "hundreds of thousands of dollars per year to just thousands of dollars per year" on specific processes.

Numbers not disclosed¶

Maple's own service latency / throughput / QPS.
Temporal workflow / activity counts per job.
Concurrency across batches (serial vs parallel batch submission).
Real-time fallback path concurrency limits + per-provider rate-limit ceilings.
Per-team Cost Tracker numbers (absolute or relative).
AI-Gateway fan-out factor / provider count / which providers are integrated.
Specific LLM model(s) used (no OpenAI / Anthropic / Google / Mistral vendor name).
Error-rate breakdown across the four failure classes.
Image-check cost — the "significant overhead" rationale for deferring to retry #2 is qualitative.
orjson switchover impact (memory or latency delta).
DB-to-Parquet migration impact (costs before/after).
Temporal-specific ops datapoints (worker count, activity retry budget, durable-timer usage for 24h polling).

Caveats¶

Announcement-voice post — architecture section is solid but many implementation details gestured at rather than specified (Temporal activity decomposition, error-handling state machine, polling cadence, parallel-batch submission strategy).
LLM provider is unnamed — the 50K-prompt / 200 MB / 24 h SLA fits OpenAI's Batch API closely, but Anthropic's Message Batches API has similar constraints; post avoids vendor commitment. Matters because some of Maple's design choices (e.g. 50% savings, 24 h SLA, batch-vs-real-time capability split) are inherited from a specific provider's API surface, not general to all LLM providers.
"AI Gateway" is underspecified — the post names it, describes it as proxying requests and integrating with Cost Tracker, but does not disclose its architecture (is it a single worker? a fleet? per-region? does it do caching / semantic caching / model fallback like systems/cloudflare-ai-gateway?). The PIXEL post (sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform) also references "existing Instacart infra" without disclosing the AI Gateway level.
No failure mode analysis on Temporal itself — if Temporal is partitioned / unavailable, Maple jobs stall. Post doesn't address this (may have been deemed out of scope for a feature-overview post).
Sample size caveats — the ~580-batch sample is real but the prompt mix isn't characterised (how many include images? what modalities? what model?); throughput numbers don't generalise beyond Instacart's prompt-image mix.
Batch-API vs real-time choice is presented as either-or, but the real-time-fallback path is only for providers without a batch API — not a cost-optimisation fallback for small batches on providers with both. The "small batches complete more quickly" framing is about the real-time-only-providers path, not a Maple design knob.
Python-specific scale ceiling not named — stream-processing + orjson + PyArrow handle most of the memory problem, but 10M+ prompt jobs at a single Maple Temporal worker's GIL boundary is plausibly a bottleneck. Post doesn't say how Maple scales horizontally.

Source¶

companies/instacart — operator.
systems/maple-instacart — the system this post introduces.
systems/instacart-ai-gateway — proxy tier Maple calls through.
systems/instacart-cost-tracker — per-team usage accounting system.
systems/temporal — durable-execution substrate.
systems/aws-s3 — intermediate + final storage.
systems/apache-parquet — intermediate file format (25× vs CSV, columnar random access).
concepts/durable-execution — Temporal's motivating property.
concepts/llm-batch-api — the provider API Maple abstracts.
concepts/provider-failure-taxonomy — Maple's four-class failure framing (expired / rate-limited / refused / invalid-image).
concepts/stream-based-file-processing — memory-bound input handling.
concepts/cost-tracking-per-team — AI-Gateway-level accounting primitive.
patterns/llm-batch-processing-service — new canonical pattern this post introduces.
patterns/batch-then-real-time-fallback — unified-interface pattern for heterogeneous provider capability.
patterns/infinite-retry-by-failure-class — the class-dependent retry policy.
patterns/ai-gateway-provider-abstraction — the tier below Maple.
patterns/csv-in-parquet-intermediate-output-merge — the storage-shape pattern.
sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — sibling Instacart platform post — PIXEL is the image-generation ML-platform consolidation; Maple is the batch-inference ML-platform consolidation. Same company, same "stop every team from DIY'ing this" architectural stance, same AI-Gateway-below / platform-above layering.
sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — sibling Instacart platform post — PARSE is the attribute-extraction platform; explicitly names Catalog + Fulfillment as teams, same as Maple. PARSE + Maple likely compose in production (PARSE calls → millions of prompts → Maple batch dispatch).
sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — text-LLM sibling of the provider-abstraction shape Maple sits above.
sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — coding-agent sibling of the provider-abstraction shape.
sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — another canonical wiki instance of Temporal-as-automation-layer for a batch-pipeline provisioning workflow; different domain (CDC vs LLM batches), same substrate rationale.