CONCEPT Cited by 1 source

Stream-based file processing¶

Definition¶

Stream-based file processing is the discipline of processing files one record at a time (or one chunk at a time) rather than loading the full file into memory. It bounds per-job memory consumption to the record size, not the file size — enabling one worker process to handle files orders of magnitude larger than its RAM.

The classical anti-pattern it displaces: "load the CSV into a dataframe, iterate, write the result." Works fine for MB-scale files, falls over at GB-scale, crashes at 10×+ GB.

Why it shows up at LLM-batch scale¶

The Instacart Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27) explicitly names stream-based processing as a scale-forced fix:

"As our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations. … We adopted stream-based processing to minimize memory consumption when handling large files."

With LLM batch APIs capped at 50K prompts / 200 MB per batch, a 10M-prompt job is hundreds of GB of intermediate state — orders of magnitude beyond any reasonable per-process memory budget.

Three specific Maple optimisations compose into stream processing:

Data on S3 not in a database — avoids loading through a query boundary that forces full materialisation.
Parquet instead of CSV — columnar format with per-column compression (~25×) + random-access support means reads can be selective (only the columns needed) and incremental (one row-group at a time).
orjson instead of Python stdlib json — faster + more memory-efficient JSON parser. Per-record cost matters when the per-record count is in the tens of millions.

Generalisation¶

Stream processing applies anywhere a pipeline's input grows beyond what a single process can fit in RAM:

ETL pipelines — process records one at a time; emit to output sink as you go.
Log processing — scan sequentially, aggregate on the fly.
Large-file uploads — multipart/chunked bodies, each chunk acknowledged before the next is read.
Database migrations — cursor-based row-by-row read, not SELECT *.
CSV/Parquet ETL — use library primitives that return iterators / chunked readers, not full-file loaders.

The generalisation is: memory consumption should be O(record size) not O(file size). Once a pipeline crosses into O(file size) memory territory, it becomes a ticking ceiling — eventual input growth crashes it.

Implementation knobs¶

Chunk size: larger chunks amortise per-chunk overhead but raise peak RSS. Typical sweet spot is 1-10 MB at the columnar-file layer.
Backpressure: when the downstream is slower than the reader, throttle the reader — otherwise in-flight chunk queue unbounded. See concepts/backpressure.
Parallelism: chunks can usually be processed concurrently; need to preserve order only at merge time.
Fault tolerance: chunk-granular state lets jobs resume from the last successfully-processed chunk on crash — composes with concepts/durable-execution substrates like Temporal.

Seen in¶

sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart Maple cites stream-based processing as load-bearing for scaling to 10M+ prompt jobs, composing S3-Parquet storage
PyArrow + orjson into a bounded-memory batch pipeline.

concepts/llm-batch-api — one producer of stream-processing pressure at 10M+ prompt scale.
concepts/backpressure — the flow-control discipline that keeps stream pipelines from accumulating unbounded buffer.
concepts/durable-execution — the substrate that composes cleanly with chunk-granular state.
patterns/llm-batch-processing-service — Maple's canonical pattern.
systems/maple-instacart — canonical production instance at 10M+ prompt scale.
systems/apache-parquet — the columnar format that makes stream processing cheap.