CONCEPT Cited by 1 source
Data file analysis¶
Definition¶
Data file analysis is the agent primitive of scanning every file in a working directory — structured and unstructured — and emitting a rich textual description of each file's structure and contents before planning begins. The descriptions become grounding context for the downstream Planner agent.
The primitive is named in Google Research's DS-STAR architecture, where a dedicated Data File Analyzer agent writes and executes a Python file-summarisation script as the first stage of the agent loop (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Why it matters¶
Prior data-science agents relied on schema inspection of well- structured inputs (CSV headers, relational-database catalogues). That approach "ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Data file analysis extends the grounding substrate to heterogeneous formats by having the agent write its own inspection code instead of relying on hard-coded format parsers. Result: the Planner receives rich natural-language descriptions of each file (columns, key ranges, null patterns, free-text samples, markdown structure) it can reason against.
Ablation-critical¶
DS-STAR's Variant 1 ablation removes the Data File Analyzer's descriptions and measures DABStep hard-task accuracy collapsing from 45.2 % to 26.98 % — "underscoring the importance of rich data context for effective planning and implementation" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
This is one of the clearer numeric answers on the wiki to "how much does pre-loop context matter for an LLM planning agent?" — for hard heterogeneous-data tasks, the answer is 18 accuracy points, roughly 40 % relative.
Mechanism¶
- Directory scan. Agent enumerates all files under the working directory.
- Per-file inspection. Agent writes a small Python script per file to extract format-appropriate summaries — column names + dtypes + value ranges for CSV, key paths + sample values for JSON, heading structure + length + sample content for markdown, first- kilobytes + detected encoding for arbitrary text.
- Textual emission. Summaries are emitted as natural-language descriptions, readable by the Planner.
- Inject into context. Descriptions are concatenated into the Planner's system prompt or initial message as grounding material.
The agent writes the inspection code aspect is the key primitive: there is no pre-supplied list of format parsers — the LLM generates what each file type needs, executed via the Coder substrate, and reads back the output.
Tradeoffs / gotchas¶
- Cost. Pre-loop file analysis pays inference + execution time before the Planner even starts. For large working directories or big files, the summary-generation step can itself be expensive.
- Truncation. Very large files (GB-scale) can't be fully summarised; the analyzer must decide what to sample. Not specified in the DS-STAR post.
- Secrets / sensitive data. A scan-and-summarise primitive will happily read whatever is in the working directory, including files the user didn't mean the agent to inspect. Scope control belongs outside this primitive.
- Summary drift. The file contents may change during the loop (Coder writes intermediate files). Whether the analyzer is re-run when the directory changes is a design decision the DS-STAR post doesn't specify.
Seen in¶
- sources/2025-11-06-google-ds-star-versatile-data-science-agent — canonical wiki instance. Data File Analyzer as agent 1 of 5 in DS-STAR; ablation-quantified as load-bearing (45.2 % → 26.98 % without it on DABStep hard tasks).
Related¶
- concepts/heterogeneous-data-formats — the problem class data file analysis is the answer to.
- concepts/iterative-plan-refinement — pre-loop context that the downstream inner-loop refinement builds on.
- systems/ds-star — the agent system this primitive appears in.
- patterns/planner-coder-verifier-router-loop — the pattern the Data File Analyzer feeds into (pre-loop stage).