Skip to content

CONCEPT Cited by 1 source

Data file analysis

Definition

Data file analysis is the agent primitive of scanning every file in a working directory — structured and unstructured — and emitting a rich textual description of each file's structure and contents before planning begins. The descriptions become grounding context for the downstream Planner agent.

The primitive is named in Google Research's DS-STAR architecture, where a dedicated Data File Analyzer agent writes and executes a Python file-summarisation script as the first stage of the agent loop (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Why it matters

Prior data-science agents relied on schema inspection of well- structured inputs (CSV headers, relational-database catalogues). That approach "ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Data file analysis extends the grounding substrate to heterogeneous formats by having the agent write its own inspection code instead of relying on hard-coded format parsers. Result: the Planner receives rich natural-language descriptions of each file (columns, key ranges, null patterns, free-text samples, markdown structure) it can reason against.

Ablation-critical

DS-STAR's Variant 1 ablation removes the Data File Analyzer's descriptions and measures DABStep hard-task accuracy collapsing from 45.2 % to 26.98 %"underscoring the importance of rich data context for effective planning and implementation" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

This is one of the clearer numeric answers on the wiki to "how much does pre-loop context matter for an LLM planning agent?" — for hard heterogeneous-data tasks, the answer is 18 accuracy points, roughly 40 % relative.

Mechanism

  1. Directory scan. Agent enumerates all files under the working directory.
  2. Per-file inspection. Agent writes a small Python script per file to extract format-appropriate summaries — column names + dtypes + value ranges for CSV, key paths + sample values for JSON, heading structure + length + sample content for markdown, first- kilobytes + detected encoding for arbitrary text.
  3. Textual emission. Summaries are emitted as natural-language descriptions, readable by the Planner.
  4. Inject into context. Descriptions are concatenated into the Planner's system prompt or initial message as grounding material.

The agent writes the inspection code aspect is the key primitive: there is no pre-supplied list of format parsers — the LLM generates what each file type needs, executed via the Coder substrate, and reads back the output.

Tradeoffs / gotchas

  • Cost. Pre-loop file analysis pays inference + execution time before the Planner even starts. For large working directories or big files, the summary-generation step can itself be expensive.
  • Truncation. Very large files (GB-scale) can't be fully summarised; the analyzer must decide what to sample. Not specified in the DS-STAR post.
  • Secrets / sensitive data. A scan-and-summarise primitive will happily read whatever is in the working directory, including files the user didn't mean the agent to inspect. Scope control belongs outside this primitive.
  • Summary drift. The file contents may change during the loop (Coder writes intermediate files). Whether the analyzer is re-run when the directory changes is a design decision the DS-STAR post doesn't specify.

Seen in

Last updated · 200 distilled / 1,178 read