Skip to content

INSTACART 2025-08-27 Tier 2

Read original ↗

Instacart — Simplifying Large-Scale LLM Processing across Instacart with Maple

Summary

Instacart Engineering post (2025-08-27) describing Maple — an internal batch-LLM-processing service that turns millions-of-prompt jobs into a CSV/Parquet in / CSV/Parquet out interface, abstracting the LLM provider's 50K-prompt / 200 MB-per-batch batch API into a single RPC. Maple runs on Temporal for durable execution, stores inputs / intermediate batches / outputs on S3 as Parquet (25× compression vs CSV + columnar random access), proxies through an Instacart AI Gateway (distinct from the LLM provider) that integrates with a Cost Tracker for per-team usage accounting, and implements failure-class-specific retry policies (infinite for rate-limit + expired, bounded for refused, image-URL-check-on-retry for invalid-image). Reported outcomes: ~50% cost reduction vs real-time LLM calls, scale to 10M+ prompt jobs, batch throughput measured at ~2.6 prompts/sec avg with most batches completing in under 12 hours across a sample of ~580 batches at 40–50K tasks/batch. Maple was later extended to wrap non-batch (real-time-only) providers behind the same CSV interface with automatic parallelisation + exponential backoff — useful for ops-iteration-friendly small batches. Canonical batch-LLM-platform sibling of the text-AI-Gateway (Cloudflare / Databricks) + image-AI-platform (PIXEL) pattern graph: same "stop every team from DIY'ing this" ML-platform consolidation play, at the batch-inference layer.

Key takeaways

  1. Batch inference APIs are economically transformative but operationally hostile. LLM provider batch endpoints promise "up to 50% cost reduction vs real-time" but expose a 50K-prompt / 200 MB per-batch ceiling — a 1M-prompt job means at least 20 separate batches, each requiring encode → upload → status-poll → download → parse → retry-failed → repeat. Every team that tried to use them independently re-wrote this workflow. Maple consolidates it into a CSV/Parquet in, merged-output out RPC. (Source: Maple)

  2. Temporal is the load-bearing substrate. Every activity in the pipeline (encode, upload, poll, download, decode, merge) is a Temporal activity; the overall job is a Temporal workflow. This is a canonical production instance of concepts/durable-execution applied to long-running batch pipelines: "Even if exceptions occur, Temporal's fault tolerance safeguards data integrity and guarantees job completion." Instacart's specific payoff: "protects against data loss but also avoids wasting money on partially completed jobs" — the cost angle is load-bearing because LLM batch inference is paid at batch-submit time, not at batch-complete time.

  3. S3-Parquet, not a database, for intermediate state. Inputs, per-batch splits, per-batch outputs, and final merged outputs all live on S3 as Parquet files. Stated rationale: "avoiding costly database operations … not only cheaper but also allows handling large datasets." Parquet-specific wins disclosed: up to 25× size reduction vs CSV (per-column compression), non-linear (random-access) reads into the file — the columnar property is used at merge time, not just at archival. patterns/metadata-plus-chunk-storage-stack at the batch-job granularity.

  4. The AI-Gateway layer is two-tier, not one-tier. Maple proxies all its LLM calls through an Instacart AI Gateway (internal service), which in turn routes to the external LLM provider + logs usage to a Cost Tracker. This is the classical patterns/ai-gateway-provider-abstraction pattern (Cloudflare / Databricks are the text-LLM siblings), but Maple is a consumer of the AI Gateway rather than the AI Gateway itself — the batch-processing layer sits above the provider-abstraction layer, which sits above the provider. Each concern lives at the layer where it is easiest to build once: batch pipeline = Maple, provider routing + cost tracking = AI Gateway, inference = LLM provider.

  5. Failure-class-specific retry policy is the heart of the reliability story. The post enumerates four task-level failure modes, each with its own policy: (a) Expired (provider fails to return within 24 h) → retry infinitely by default (construct a new batch with failed tasks); (b) Rate-limited (provider token-limit exceeded) → retry infinitely by default; (c) Refused (bad params, filtered image/prompt) → retry max 2× default ("probably return the same result" otherwise); (d) Invalid image (image URL dead or unreachable) → retry option that checks image existence before resubmitting, but only on the second attempt (checking every URL on the first pass "can add significant overhead"). Canonical instance of patterns/infinite-retry-by-failure-class — the retry policy is a function of which failure, not one-size-fits-all.

  6. Performance disclosures, from a ~580-batch / 40–50K-tasks-each sample — real production data, not vendor-quoted: mean throughput ~2.6 prompts/sec per batch (histogram clustered 1–4 prompts/sec), most batches complete in under 12 hours (some take nearly the full 24 h SLA), completion time increases with job size (positive slope on a log-y-axis scatter plot). "Processing time can vary based on the prompt, especially when including images with the prompt, which is a common case for us." This is the wiki's first ingested benchmark of production LLM batch-API latency at scale.

  7. Scale optimizations were not-theoretical, they were forced. Three specific upgrades named: (a) moved task data from DB to S3 Parquet as input sizes grew; (b) adopted stream-based processing to bound memory consumption (classic concepts/stream-based-file-processing move — don't load a 1M-prompt CSV into RAM); (c) replaced Python json stdlib with orjson"faster and more memory-efficient alternative." Each swap is small but the canonical lesson is: "as our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations." These optimizations "allowed Maple to scale efficiently to 10M+ prompt jobs."

  8. Batch-then-real-time-fallback unifies the interface across provider capabilities. Not all LLM providers offer batch; some are real-time-only. Rather than force teams to pick a provider based on batch availability, Maple wraps real-time-only providers behind the same CSV/Parquet interface with automatic parallelisation, exponential backoff on rate limits, intelligent retry policies, and failure tracking. "If a provider starts offering a batch interface, we can switch it over seamlessly without our users needing to do anything." Canonical patterns/batch-then-real-time-fallback — the caller interface is batch-shaped regardless of what the underlying provider supports, and small batches complete faster under real-time routing (useful for iterative ops tasks). The platform hides provider-capability heterogeneity behind a stable CSV contract — the batch-layer sibling of concepts/unified-parameter-protocol (PIXEL's image-gen version at parameter level) and patterns/unified-inference-binding (Cloudflare Workers AI text-LLM version at SDK level).

  9. Platform-level investments compound across teams. Post names the Catalog, Fulfillment, and Search teams as distinct Maple consumers with different workloads (catalog data-cleaning + attribute enrichment; perishable-item routing; ranking-model training). "Many processes have been reduced from hundreds of thousands of dollars per year to just thousands of dollars per year." Canonical concepts/model-agnostic-ml-platform consolidation claim: "Maple democratises access to bulk LLM prompt processing at Instacart. Teams can now explore new ideas, automate repetitive work, and ship faster — without becoming LLM infrastructure experts."

Architectural shape

┌────────────────────────────────────────┐
│  Internal client (Catalog / Fulfillment │
│  / Search / Ads team)                   │
└──────────────────────┬──────────────────┘
          │ CSV or Parquet file + prompt template
          │ (RPC API)
  ┌──────────────────────────────────┐
  │  Maple service (Python, PyArrow) │
  │  — [Temporal](<../systems/temporal.md>) workflow                │
  │  — S3-parquet intermediates      │
  │  — per-batch 50K-prompt / 200 MB │
  │    split                         │
  │  — stream-based processing       │
  │  — orjson for JSON parsing       │
  │  — failure-class retry policy    │
  └───────────────┬──────────────────┘
          │ encoded batch file (LLM provider batch format)
  ┌──────────────────────────────────┐
  │  Instacart AI Gateway            │
  │  — provider routing              │
  │  — Cost Tracker integration      │
  │  — per-team spend attribution    │
  └───────────────┬──────────────────┘
          │ provider-specific batch or real-time API
  ┌──────────────────────────────────┐
  │  External LLM provider           │
  │  — 50K-prompt / 200 MB batch cap │
  │  — 24h SLA                       │
  │  — ~2.6 tasks/sec mean throughput│
  └──────────────────────────────────┘

Intermediate storage shape:

Input CSV (client-provided)
  ↓ split + encode
Batch 1 Parquet (≤ 50K prompts)  →  provider batch upload
Batch 2 Parquet                  →  provider batch upload
...                              →  ...
Batch N Parquet                  →  provider batch upload
                                    ↓ poll for completion
                                  Download results
                                    ↓ decode + join by task ID
Per-batch result Parquet 1
Per-batch result Parquet 2
...
Per-batch result Parquet N
  ↓ merge
Final output file (CSV or Parquet, mirrors input format)

Numbers disclosed

  • Batch ceiling: 50,000 prompts OR 200 MB per batch (LLM provider constraint).
  • Scale: 10M+ prompt jobs handled by Maple; at least 20 batches needed for a 1M-prompt job.
  • Cost: "up to 50% on LLM costs compared to standard real-time calls" — savings from batch vs real-time, not from Maple per se.
  • Production sample: ~580 batches, 40–50K tasks per batch (most at 50K).
  • Throughput: mean 2.6 prompts/sec per batch, distribution clustered 1–4 prompts/sec.
  • Batch completion time: most batches complete in < 12 h; some approach the 24 h SLA.
  • Provider SLA: 24 hours per batch.
  • Size reduction: Parquet claims up to 25× smaller than CSV (Instacart's attribution; intrinsic Parquet property).
  • Cost savings per workload: "hundreds of thousands of dollars per year to just thousands of dollars per year" on specific processes.

Numbers not disclosed

  • Maple's own service latency / throughput / QPS.
  • Temporal workflow / activity counts per job.
  • Concurrency across batches (serial vs parallel batch submission).
  • Real-time fallback path concurrency limits + per-provider rate-limit ceilings.
  • Per-team Cost Tracker numbers (absolute or relative).
  • AI-Gateway fan-out factor / provider count / which providers are integrated.
  • Specific LLM model(s) used (no OpenAI / Anthropic / Google / Mistral vendor name).
  • Error-rate breakdown across the four failure classes.
  • Image-check cost — the "significant overhead" rationale for deferring to retry #2 is qualitative.
  • orjson switchover impact (memory or latency delta).
  • DB-to-Parquet migration impact (costs before/after).
  • Temporal-specific ops datapoints (worker count, activity retry budget, durable-timer usage for 24h polling).

Caveats

  • Announcement-voice post — architecture section is solid but many implementation details gestured at rather than specified (Temporal activity decomposition, error-handling state machine, polling cadence, parallel-batch submission strategy).
  • LLM provider is unnamed — the 50K-prompt / 200 MB / 24 h SLA fits OpenAI's Batch API closely, but Anthropic's Message Batches API has similar constraints; post avoids vendor commitment. Matters because some of Maple's design choices (e.g. 50% savings, 24 h SLA, batch-vs-real-time capability split) are inherited from a specific provider's API surface, not general to all LLM providers.
  • "AI Gateway" is underspecified — the post names it, describes it as proxying requests and integrating with Cost Tracker, but does not disclose its architecture (is it a single worker? a fleet? per-region? does it do caching / semantic caching / model fallback like systems/cloudflare-ai-gateway?). The PIXEL post (sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform) also references "existing Instacart infra" without disclosing the AI Gateway level.
  • No failure mode analysis on Temporal itself — if Temporal is partitioned / unavailable, Maple jobs stall. Post doesn't address this (may have been deemed out of scope for a feature-overview post).
  • Sample size caveats — the ~580-batch sample is real but the prompt mix isn't characterised (how many include images? what modalities? what model?); throughput numbers don't generalise beyond Instacart's prompt-image mix.
  • Batch-API vs real-time choice is presented as either-or, but the real-time-fallback path is only for providers without a batch API — not a cost-optimisation fallback for small batches on providers with both. The "small batches complete more quickly" framing is about the real-time-only-providers path, not a Maple design knob.
  • Python-specific scale ceiling not named — stream-processing + orjson + PyArrow handle most of the memory problem, but 10M+ prompt jobs at a single Maple Temporal worker's GIL boundary is plausibly a bottleneck. Post doesn't say how Maple scales horizontally.

Source

Last updated · 319 distilled / 1,201 read