Google Research — DS-STAR: A state-of-the-art versatile data science agent¶
Summary¶
Google Research introduces DS-STAR — a data-science agent built from four specialised LLM sub-agents (Data File Analyzer, Planner, Coder, Verifier) plus a Router, arranged in an iterative plan-refinement loop that emulates how an expert analyst works in a notebook: write a step, inspect the intermediate result, decide whether to add a new step or fix an existing one, repeat. DS-STAR achieves top-rank performance on the DABStep public leaderboard (as of 2025-09-18) and sets a new state-of-the-art on DABStep, KramaBench, and DA-Code against AutoGen and DA-Agent baselines. The two load-bearing architectural ideas are (1) an up-front Data File Analyzer agent that emits textual descriptions of every file in the working directory — structured and unstructured — so the Planner has rich context for heterogeneous data formats (CSV, JSON, markdown, unstructured text), and (2) an LLM-based judge (the Verifier) that scores the current plan's sufficiency at each step, with a Router agent deciding whether the next action is add a step or fix an existing step — the ablation shows the add-vs-fix decision, not just adding, is what makes the loop work.
Key takeaways¶
- Four-agent decomposition plus a Router is the architectural shape. "The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle." (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). This is a canonical instance of patterns/planner-coder-verifier-router-loop — sibling of patterns/specialized-agent-decomposition from the Storex / Dash / AWS Strands lineage, but with a verification gate in the inner loop rather than routing across independent domain specialists.
- The Data File Analyzer is essential, not cosmetic. Ablation Variant 1 (no analyzer descriptions) collapses DABStep hard-task accuracy from 45.2 % → 26.98 % — "underscoring the importance of rich data context for effective planning and implementation" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The analyzer is implemented as a Python script the agent itself writes and runs, extracting "key information" from each file — not a hard-coded schema inspector. Canonical wiki instance of concepts/data-file-analysis as an agent primitive.
- The Verifier is an LLM judge operating on plans, not on outputs. "The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate." Extends concepts/llm-as-judge to a new axis — plan-sufficiency scoring inside the generation loop (rather than post-hoc trajectory / answer scoring inside an eval harness). The judge is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
- Correcting wrong steps beats piling on more steps. Ablation Variant 2 removes the Router (so the system only adds new steps sequentially). "This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The add-or-fix branch is the distinguishing primitive over a naive plan-extend loop — see patterns/planner-coder-verifier-router-loop.
- Hard tasks need more rounds; the budget is bounded. The loop terminates at satisfaction or 10 refinement rounds. Empirically, hard tasks on DABStep average 5.6 rounds; easy tasks average 3.0; "over half of the easy tasks were completed in just a single round" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Canonical numeric instance of a refinement-round budget with task-difficulty-conditioned iteration counts.
- State-of-the-art across three data-science benchmarks, not just one. Versus the best prior baseline (AutoGen / DA-Agent): DABStep 41.0 % → 45.2 % (+4.2), KramaBench 39.8 % → 44.7 % (+4.9), DA-Code 37.0 % → 38.5 % (+1.5); top rank on DABStep public leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The "strong advantage in hard tasks that require processing multiple, heterogeneous data files" is the empirical payoff of the Data File Analyzer + iterative refinement combination.
- The framework is LLM-swappable. Tested with both Gemini-2.5-Pro (default) and GPT-5 as the base model. GPT-5 slightly better on easy tasks; Gemini-2.5-Pro better on hard tasks — but both work, demonstrating the architecture is not a Gemini-specific trick (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Echoes the cross-model-portability result Dropbox Dash obtained for its LLM judge via DSPy retargeting — the architectural primitive is the portable thing, not the underlying model.
- Expert-analyst-in-notebook is the named mental model. "DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The architecture is explicitly derived from human analyst workflow — incremental, verification-gated, with backtracking.
Systems extracted¶
- systems/ds-star — the agent framework itself.
- systems/dabstep — the primary benchmark; DS-STAR ranked #1 on its public leaderboard at 2025-09-18.
- systems/autogen — one of the two named baselines (Microsoft's multi-agent conversation framework).
- DA-Agent — second baseline, referenced as the DA-Code benchmark's accompanying agent; no dedicated wiki page created (no independent architectural content in the post beyond the comparison).
- KramaBench, DA-Code — additional benchmarks; no dedicated wiki pages (benchmark references only).
- Gemini-2.5-Pro, GPT-5 — base LLMs used; neither gets a dedicated wiki page (the post treats them as swappable substrate, not the subject).
Concepts extracted¶
- concepts/iterative-plan-refinement — plan-then-verify-then- fix-or-extend, repeated until sufficient or budget exhausted. DS-STAR is the canonical wiki instance.
- concepts/data-file-analysis — agent primitive of scanning the working directory and emitting rich textual file-format descriptions before planning.
- concepts/heterogeneous-data-formats — the problem class DS-STAR's analyzer is pitched against (CSV, JSON, markdown, unstructured text).
- concepts/refinement-round-budget — the bounded-iteration discipline of a judge-gated agent loop.
- concepts/llm-as-judge — extended with the plan-judgment axis (judge scores a plan's sufficiency inside the generation loop, not post-hoc on outputs).
Patterns extracted¶
- patterns/planner-coder-verifier-router-loop — the four-agent-plus-router architectural shape, with the add-or-fix branch as the distinguishing primitive over a naive extend-only loop.
- patterns/specialized-agent-decomposition — DS-STAR as an inner-loop verification-gated instance of this pattern, sibling of the Storex / Dash / AWS Strands per-domain instances.
Operational numbers¶
| Metric | Value | Scope |
|---|---|---|
| DABStep hard-task accuracy, full system | 45.2 % | vs AutoGen/DA-Agent |
| DABStep hard-task accuracy, no analyzer (Variant 1) | 26.98 % | ablation |
| DABStep improvement over best baseline | 41.0 % → 45.2 % (+4.2) | benchmark |
| KramaBench improvement over best baseline | 39.8 % → 44.7 % (+4.9) | benchmark |
| DA-Code improvement over best baseline | 37.0 % → 38.5 % (+1.5) | benchmark |
| Max refinement rounds | 10 | loop budget |
| Avg rounds on hard tasks | 5.6 | DABStep |
| Avg rounds on easy tasks | 3.0 | DABStep |
| Easy tasks completed in 1 round | >50 % | DABStep |
| DABStep public leaderboard rank | #1 (as of 2025-09-18) | leaderboard |
Caveats¶
- Raw capture is extremely thin. The locally-saved raw markdown contains only the "In-depth analysis" ablation paragraphs; the motivation, architecture walk-through, evaluation table, and round-count figure live in the original blog post body, retrieved in-session and quoted here with the URL cited verbatim. Wiki pages reflect what the full post verifiably contains.
- Backing paper not ingested. The paper at arXiv 2509.21825 likely contains the Verifier prompt, Router decision logic, per-agent latency + token cost, production-deployment disclosure, and per-benchmark decomposition of the improvement. None are in the blog post.
- Ablation table numbers are partially redacted in the prose. The post narrates Variant 1 and Variant 2 qualitatively but publishes the full ablation table only as an image ("DS-STAR - table"). Exact Variant 2 accuracies and the Gemini-2.5-Pro vs GPT-5 per-task-difficulty numbers are in the image, not the prose.
- No production-deployment disclosure. The post positions DS-STAR as research; no Google Cloud productisation status, no customer reference, no cost / latency / throughput numbers at scale. DABStep ranking is on a public leaderboard, not a production workload.
- "Hard task" is DABStep's definition, not a universal one. Hard = requires multiple data files; easy = single file. The refinement-round budget (5.6 vs 3.0 rounds) is specific to that decomposition and shouldn't be transplanted to other task-difficulty framings without recalibration.
- The Router is itself an LLM, not a hard-coded classifier. The post describes it as an "agent" with the add-or-fix decision responsibility, but doesn't disclose its prompt, its failure modes, or its cost vs. the Planner / Coder / Verifier agents.
Source¶
- Original: https://research.google/blog/ds-star-a-state-of-the-art-versatile-data-science-agent/
- Raw markdown:
raw/google/2025-11-06-ds-star-a-state-of-the-art-versatile-data-science-agent-0f91b035.md - Backing paper: arXiv 2509.21825
- DABStep leaderboard: huggingface.co/spaces/adyen/DABstep
Related¶
- systems/ds-star — the agent framework.
- systems/dabstep — the primary benchmark DS-STAR ranks #1 on.
- systems/autogen — baseline comparator.
- concepts/iterative-plan-refinement — the inner-loop discipline.
- concepts/llm-as-judge — the Verifier agent's primitive; extended here with the plan-sufficiency axis.
- concepts/data-file-analysis — the ablation-critical preprocessing primitive.
- concepts/heterogeneous-data-formats — the problem class.
- concepts/refinement-round-budget — the bounded-iteration discipline.
- patterns/planner-coder-verifier-router-loop — the four-agent architectural shape.
- patterns/specialized-agent-decomposition — the parent pattern DS-STAR instantiates with inner-loop verification.
- companies/google — the author.