CONCEPT Cited by 1 source
Heterogeneous data formats¶
Definition¶
Heterogeneous data formats are the real-world data mix agents must handle: CSV + JSON + markdown + plain text + logs + binary — structured, semi-structured, and unstructured — often sitting together in the same working directory. The named failure mode this concept is scaffolded against:
"Prior data-science agents have a major issue — their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications."
— Google Research, DS-STAR post, 2025-11-06 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Why it's a load-bearing concept for agents¶
Schema-inspection primitives (pandas describe(), SQL INFORMATION_
SCHEMA, JSON schema validators) assume a target format and emit
typed summaries only for that format. For directories with a mix of
formats, no single pre-existing primitive suffices; the agent has to
inspect each file's format and emit per-format summaries — the
data file analysis primitive.
The practical cost of getting this wrong is quantified by DS-STAR's ablation: removing the Data File Analyzer (which targets heterogeneous formats directly) collapses DABStep hard-task accuracy from 45.2 % to 26.98 % — a ~40 % relative degradation on tasks that, by DABStep's definition, require multiple files (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Format classes agents encounter in practice¶
- Structured tabular. CSV, TSV, Parquet, Excel. Columns + dtypes
- row count are the summary primitives.
- Semi-structured. JSON, YAML, XML. Key paths + value types + array shapes + sample values.
- Semi-structured text. Markdown, reStructuredText. Heading tree + section lengths + link / image counts.
- Unstructured text. Free-form .txt, logs, email dumps. Length
- language detection + representative samples.
- Binary. Images, audio, model weights. Metadata (dimensions, sample rate, file magic) rather than content summary.
A robust agent must have summary templates for each class, or (DS-STAR approach) generate per-file inspection code on demand.
Seen in¶
- sources/2025-11-06-google-ds-star-versatile-data-science-agent — named as the problem class DS-STAR's Data File Analyzer targets, with DABStep hard tasks as the evaluation surface for multi-file heterogeneous-data reasoning.
Related¶
- concepts/data-file-analysis — the agent primitive that targets heterogeneous-format handling.
- systems/ds-star — the canonical agent instance designed around this concept.