CONCEPT Cited by 1 source
Stream-based file processing¶
Definition¶
Stream-based file processing is the discipline of processing files one record at a time (or one chunk at a time) rather than loading the full file into memory. It bounds per-job memory consumption to the record size, not the file size — enabling one worker process to handle files orders of magnitude larger than its RAM.
The classical anti-pattern it displaces: "load the CSV into a dataframe, iterate, write the result." Works fine for MB-scale files, falls over at GB-scale, crashes at 10×+ GB.
Why it shows up at LLM-batch scale¶
The Instacart Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27) explicitly names stream-based processing as a scale-forced fix:
"As our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations. … We adopted stream-based processing to minimize memory consumption when handling large files."
With LLM batch APIs capped at 50K prompts / 200 MB per batch, a 10M-prompt job is hundreds of GB of intermediate state — orders of magnitude beyond any reasonable per-process memory budget.
Three specific Maple optimisations compose into stream processing:
- Data on S3 not in a database — avoids loading through a query boundary that forces full materialisation.
- Parquet instead of CSV — columnar format with per-column compression (~25×) + random-access support means reads can be selective (only the columns needed) and incremental (one row-group at a time).
- orjson instead of Python stdlib
json— faster + more memory-efficient JSON parser. Per-record cost matters when the per-record count is in the tens of millions.
Generalisation¶
Stream processing applies anywhere a pipeline's input grows beyond what a single process can fit in RAM:
- ETL pipelines — process records one at a time; emit to output sink as you go.
- Log processing — scan sequentially, aggregate on the fly.
- Large-file uploads —
multipart/chunkedbodies, each chunk acknowledged before the next is read. - Database migrations — cursor-based row-by-row read, not
SELECT *. - CSV/Parquet ETL — use library primitives that return iterators / chunked readers, not full-file loaders.
The generalisation is: memory consumption should be O(record size)
not O(file size). Once a pipeline crosses into O(file size)
memory territory, it becomes a ticking ceiling — eventual input
growth crashes it.
Implementation knobs¶
- Chunk size: larger chunks amortise per-chunk overhead but raise peak RSS. Typical sweet spot is 1-10 MB at the columnar-file layer.
- Backpressure: when the downstream is slower than the reader, throttle the reader — otherwise in-flight chunk queue unbounded. See concepts/backpressure.
- Parallelism: chunks can usually be processed concurrently; need to preserve order only at merge time.
- Fault tolerance: chunk-granular state lets jobs resume from the last successfully-processed chunk on crash — composes with concepts/durable-execution substrates like Temporal.
Seen in¶
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart Maple cites stream-based processing as load-bearing for scaling to 10M+ prompt jobs, composing S3-Parquet storage
- PyArrow + orjson into a bounded-memory batch pipeline.
Related¶
- concepts/llm-batch-api — one producer of stream-processing pressure at 10M+ prompt scale.
- concepts/backpressure — the flow-control discipline that keeps stream pipelines from accumulating unbounded buffer.
- concepts/durable-execution — the substrate that composes cleanly with chunk-granular state.
- patterns/llm-batch-processing-service — Maple's canonical pattern.
- systems/maple-instacart — canonical production instance at 10M+ prompt scale.
- systems/apache-parquet — the columnar format that makes stream processing cheap.