PATTERN Cited by 1 source
CSV-in / Parquet-intermediate / output-merge¶
Intent¶
Accept user-friendly file formats (CSV or Parquet) at the service boundary, but immediately convert to a columnar intermediate format (Parquet) for all internal processing — batch splits, per-batch outputs, merge staging. The output file mirrors the input format so callers see a clean CSV-in / CSV-out abstraction and never observe the Parquet intermediates.
Why¶
Three distinct shape-of-data concerns get three different formats:
| Stage | Format | Why |
|---|---|---|
| Caller I/O | CSV or Parquet | What users have + what they want back |
| Intermediate batches | Parquet | ~25× size reduction vs CSV + random-access reads + per-column compression |
| Downstream merge | Parquet | Non-linear access to per-batch result files for efficient join-by-task-ID |
The pattern composes with stream-based processing — Parquet row-groups are the natural chunk boundary, and columnar reads let the service pull only the columns needed at each stage.
Mechanics (from Maple)¶
Maple's realisation (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):
"Implemented in Python, Maple uses PyArrow to efficiently process input files. Large CSV files are split into smaller Parquet batch files, and stored on S3 to avoid costly database usage. Parquet is an efficient file format for data table storage, where out-of-the- box compression reduces file sizes up to 25x compared to CSV. It also allows non-linear access into the file, making data access extremely fast."
The storage shape:
Input (CSV or Parquet on S3)
│ split into batches respecting provider's 50K/200MB cap
▼
Per-batch Parquet files on S3
│ encode into provider-specific batch format
▼
Uploaded to provider
│ poll for completion
▼
Per-batch result Parquet files on S3
│ merge across all batches, join on task ID
▼
Final output file (CSV or Parquet, matching input format)
Inputs and outputs live on S3 throughout — Maple never loads the full job into memory, which is a precondition for 10M+ prompt jobs.
Why not all-CSV or all-Parquet¶
- All CSV internally: row-oriented format wastes space (25× cost), requires full-file scan for column-selective reads, slow compression decode on the hot path.
- All Parquet externally: many caller teams have CSV tooling already; forcing Parquet at the boundary is gratuitous friction. Pay the CSV→Parquet conversion cost once, at the boundary.
The asymmetry (CSV-accepted, Parquet-internal, output-format-mirrors- input) is the practical optimum.
Storage choice: S3, not a database¶
Maple's explicit design argument:
"Inputs and outputs are stored in S3, avoiding costly database operations. This approach is not only cheaper but also allows handling large datasets."
A database-backed intermediate store would:
- Cost more per byte at 10M+ prompt scale.
- Force full materialisation at query boundaries.
- Complicate the memory-discipline story.
S3 is cheap, eventually durable, and Parquet-aware (S3 Select / Parquet-aware clients can push predicates down).
Generalisation¶
The pattern is reusable anywhere a batch pipeline needs to:
- Accept a user-friendly input format at the boundary.
- Process in a columnar format internally for memory + compression wins.
- Return the result in the input's format.
Sibling pattern: patterns/metadata-plus-chunk-storage-stack (Sprites / Fly.io) uses the same split-storage stance at the block level; this pattern is its row-store-ETL analog.
Caveats¶
- Parquet write-amplification — every intermediate step emits a new Parquet file; lots of small Parquet files can hurt read performance (the "small-file problem" familiar from Hive / Spark workloads). Mitigate by sizing batches to produce reasonable-size files.
- Parquet columnar reads help read-heavy workloads; a merge step that joins N result files by task ID and reads every column doesn't benefit from columnar layout as much as a column- selective query would.
- CSV parsing at the ingest boundary is the one place per-record
overhead bites; using an efficient CSV parser (PyArrow CSV reader,
not Python
csvstdlib) matters.
Seen in¶
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonical wiki instance. Maple split-then-Parquet architecture with S3 for all intermediate + final storage; 25× compression-ratio datapoint attributed to Parquet.
Related¶
- patterns/llm-batch-processing-service — parent service pattern.
- patterns/metadata-plus-chunk-storage-stack — sibling storage-shape pattern at a different abstraction level.
- concepts/stream-based-file-processing — the memory-discipline precondition.
- concepts/columnar-storage-format — Parquet's underlying property.
- systems/maple-instacart — canonical system.
- systems/aws-s3 — storage substrate.
- systems/apache-parquet — intermediate format.
- companies/instacart — operator.