GOOGLE 2025-11-06 Tier 1

Google Research — DS-STAR: A state-of-the-art versatile data science agent¶

Summary¶

Google Research introduces DS-STAR — a data-science agent built from four specialised LLM sub-agents (Data File Analyzer, Planner, Coder, Verifier) plus a Router, arranged in an iterative plan-refinement loop that emulates how an expert analyst works in a notebook: write a step, inspect the intermediate result, decide whether to add a new step or fix an existing one, repeat. DS-STAR achieves top-rank performance on the DABStep public leaderboard (as of 2025-09-18) and sets a new state-of-the-art on DABStep, KramaBench, and DA-Code against AutoGen and DA-Agent baselines. The two load-bearing architectural ideas are (1) an up-front Data File Analyzer agent that emits textual descriptions of every file in the working directory — structured and unstructured — so the Planner has rich context for heterogeneous data formats (CSV, JSON, markdown, unstructured text), and (2) an LLM-based judge (the Verifier) that scores the current plan's sufficiency at each step, with a Router agent deciding whether the next action is add a step or fix an existing step — the ablation shows the add-vs-fix decision, not just adding, is what makes the loop work.

Key takeaways¶

Four-agent decomposition plus a Router is the architectural shape. "The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle." (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). This is a canonical instance of patterns/planner-coder-verifier-router-loop — sibling of patterns/specialized-agent-decomposition from the Storex / Dash / AWS Strands lineage, but with a verification gate in the inner loop rather than routing across independent domain specialists.
The Data File Analyzer is essential, not cosmetic. Ablation Variant 1 (no analyzer descriptions) collapses DABStep hard-task accuracy from 45.2 % → 26.98 % — "underscoring the importance of rich data context for effective planning and implementation" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The analyzer is implemented as a Python script the agent itself writes and runs, extracting "key information" from each file — not a hard-coded schema inspector. Canonical wiki instance of concepts/data-file-analysis as an agent primitive.
The Verifier is an LLM judge operating on plans, not on outputs. "The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate." Extends concepts/llm-as-judge to a new axis — plan-sufficiency scoring inside the generation loop (rather than post-hoc trajectory / answer scoring inside an eval harness). The judge is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Correcting wrong steps beats piling on more steps. Ablation Variant 2 removes the Router (so the system only adds new steps sequentially). "This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The add-or-fix branch is the distinguishing primitive over a naive plan-extend loop — see patterns/planner-coder-verifier-router-loop.
Hard tasks need more rounds; the budget is bounded. The loop terminates at satisfaction or 10 refinement rounds. Empirically, hard tasks on DABStep average 5.6 rounds; easy tasks average 3.0; "over half of the easy tasks were completed in just a single round" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Canonical numeric instance of a refinement-round budget with task-difficulty-conditioned iteration counts.
State-of-the-art across three data-science benchmarks, not just one. Versus the best prior baseline (AutoGen / DA-Agent): DABStep 41.0 % → 45.2 % (+4.2), KramaBench 39.8 % → 44.7 % (+4.9), DA-Code 37.0 % → 38.5 % (+1.5); top rank on DABStep public leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The "strong advantage in hard tasks that require processing multiple, heterogeneous data files" is the empirical payoff of the Data File Analyzer + iterative refinement combination.
The framework is LLM-swappable. Tested with both Gemini-2.5-Pro (default) and GPT-5 as the base model. GPT-5 slightly better on easy tasks; Gemini-2.5-Pro better on hard tasks — but both work, demonstrating the architecture is not a Gemini-specific trick (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Echoes the cross-model-portability result Dropbox Dash obtained for its LLM judge via DSPy retargeting — the architectural primitive is the portable thing, not the underlying model.
Expert-analyst-in-notebook is the named mental model. "DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The architecture is explicitly derived from human analyst workflow — incremental, verification-gated, with backtracking.

Systems extracted¶

systems/ds-star — the agent framework itself.
systems/dabstep — the primary benchmark; DS-STAR ranked #1 on its public leaderboard at 2025-09-18.
systems/autogen — one of the two named baselines (Microsoft's multi-agent conversation framework).
DA-Agent — second baseline, referenced as the DA-Code benchmark's accompanying agent; no dedicated wiki page created (no independent architectural content in the post beyond the comparison).
KramaBench, DA-Code — additional benchmarks; no dedicated wiki pages (benchmark references only).
Gemini-2.5-Pro, GPT-5 — base LLMs used; neither gets a dedicated wiki page (the post treats them as swappable substrate, not the subject).

Concepts extracted¶

concepts/iterative-plan-refinement — plan-then-verify-then- fix-or-extend, repeated until sufficient or budget exhausted. DS-STAR is the canonical wiki instance.
concepts/data-file-analysis — agent primitive of scanning the working directory and emitting rich textual file-format descriptions before planning.
concepts/heterogeneous-data-formats — the problem class DS-STAR's analyzer is pitched against (CSV, JSON, markdown, unstructured text).
concepts/refinement-round-budget — the bounded-iteration discipline of a judge-gated agent loop.
concepts/llm-as-judge — extended with the plan-judgment axis (judge scores a plan's sufficiency inside the generation loop, not post-hoc on outputs).

Patterns extracted¶

patterns/planner-coder-verifier-router-loop — the four-agent-plus-router architectural shape, with the add-or-fix branch as the distinguishing primitive over a naive extend-only loop.
patterns/specialized-agent-decomposition — DS-STAR as an inner-loop verification-gated instance of this pattern, sibling of the Storex / Dash / AWS Strands per-domain instances.

Operational numbers¶

Metric	Value	Scope
DABStep hard-task accuracy, full system	45.2 %	vs AutoGen/DA-Agent
DABStep hard-task accuracy, no analyzer (Variant 1)	26.98 %	ablation
DABStep improvement over best baseline	41.0 % → 45.2 % (+4.2)	benchmark
KramaBench improvement over best baseline	39.8 % → 44.7 % (+4.9)	benchmark
DA-Code improvement over best baseline	37.0 % → 38.5 % (+1.5)	benchmark
Max refinement rounds	10	loop budget
Avg rounds on hard tasks	5.6	DABStep
Avg rounds on easy tasks	3.0	DABStep
Easy tasks completed in 1 round	>50 %	DABStep
DABStep public leaderboard rank	#1 (as of 2025-09-18)	leaderboard

Caveats¶

Raw capture is extremely thin. The locally-saved raw markdown contains only the "In-depth analysis" ablation paragraphs; the motivation, architecture walk-through, evaluation table, and round-count figure live in the original blog post body, retrieved in-session and quoted here with the URL cited verbatim. Wiki pages reflect what the full post verifiably contains.
Backing paper not ingested. The paper at arXiv 2509.21825 likely contains the Verifier prompt, Router decision logic, per-agent latency + token cost, production-deployment disclosure, and per-benchmark decomposition of the improvement. None are in the blog post.
Ablation table numbers are partially redacted in the prose. The post narrates Variant 1 and Variant 2 qualitatively but publishes the full ablation table only as an image ("DS-STAR - table"). Exact Variant 2 accuracies and the Gemini-2.5-Pro vs GPT-5 per-task-difficulty numbers are in the image, not the prose.
No production-deployment disclosure. The post positions DS-STAR as research; no Google Cloud productisation status, no customer reference, no cost / latency / throughput numbers at scale. DABStep ranking is on a public leaderboard, not a production workload.
"Hard task" is DABStep's definition, not a universal one. Hard = requires multiple data files; easy = single file. The refinement-round budget (5.6 vs 3.0 rounds) is specific to that decomposition and shouldn't be transplanted to other task-difficulty framings without recalibration.
The Router is itself an LLM, not a hard-coded classifier. The post describes it as an "agent" with the add-or-fix decision responsibility, but doesn't disclose its prompt, its failure modes, or its cost vs. the Planner / Coder / Verifier agents.

Source¶

Original: https://research.google/blog/ds-star-a-state-of-the-art-versatile-data-science-agent/
Raw markdown: raw/google/2025-11-06-ds-star-a-state-of-the-art-versatile-data-science-agent-0f91b035.md
Backing paper: arXiv 2509.21825
DABStep leaderboard: huggingface.co/spaces/adyen/DABstep

systems/ds-star — the agent framework.
systems/dabstep — the primary benchmark DS-STAR ranks #1 on.
systems/autogen — baseline comparator.
concepts/iterative-plan-refinement — the inner-loop discipline.
concepts/llm-as-judge — the Verifier agent's primitive; extended here with the plan-sufficiency axis.
concepts/data-file-analysis — the ablation-critical preprocessing primitive.
concepts/heterogeneous-data-formats — the problem class.
concepts/refinement-round-budget — the bounded-iteration discipline.
patterns/planner-coder-verifier-router-loop — the four-agent architectural shape.
patterns/specialized-agent-decomposition — the parent pattern DS-STAR instantiates with inner-loop verification.
companies/google — the author.