Skip to content

GOOGLE 2025-11-06 Tier 1

Read original ↗

Google Research — DS-STAR: A state-of-the-art versatile data science agent

Summary

Google Research introduces DS-STAR — a data-science agent built from four specialised LLM sub-agents (Data File Analyzer, Planner, Coder, Verifier) plus a Router, arranged in an iterative plan-refinement loop that emulates how an expert analyst works in a notebook: write a step, inspect the intermediate result, decide whether to add a new step or fix an existing one, repeat. DS-STAR achieves top-rank performance on the DABStep public leaderboard (as of 2025-09-18) and sets a new state-of-the-art on DABStep, KramaBench, and DA-Code against AutoGen and DA-Agent baselines. The two load-bearing architectural ideas are (1) an up-front Data File Analyzer agent that emits textual descriptions of every file in the working directory — structured and unstructured — so the Planner has rich context for heterogeneous data formats (CSV, JSON, markdown, unstructured text), and (2) an LLM-based judge (the Verifier) that scores the current plan's sufficiency at each step, with a Router agent deciding whether the next action is add a step or fix an existing step — the ablation shows the add-vs-fix decision, not just adding, is what makes the loop work.

Key takeaways

  1. Four-agent decomposition plus a Router is the architectural shape. "The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle." (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). This is a canonical instance of patterns/planner-coder-verifier-router-loop — sibling of patterns/specialized-agent-decomposition from the Storex / Dash / AWS Strands lineage, but with a verification gate in the inner loop rather than routing across independent domain specialists.
  2. The Data File Analyzer is essential, not cosmetic. Ablation Variant 1 (no analyzer descriptions) collapses DABStep hard-task accuracy from 45.2 % → 26.98 %"underscoring the importance of rich data context for effective planning and implementation" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The analyzer is implemented as a Python script the agent itself writes and runs, extracting "key information" from each file — not a hard-coded schema inspector. Canonical wiki instance of concepts/data-file-analysis as an agent primitive.
  3. The Verifier is an LLM judge operating on plans, not on outputs. "The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate." Extends concepts/llm-as-judge to a new axis — plan-sufficiency scoring inside the generation loop (rather than post-hoc trajectory / answer scoring inside an eval harness). The judge is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
  4. Correcting wrong steps beats piling on more steps. Ablation Variant 2 removes the Router (so the system only adds new steps sequentially). "This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The add-or-fix branch is the distinguishing primitive over a naive plan-extend loop — see patterns/planner-coder-verifier-router-loop.
  5. Hard tasks need more rounds; the budget is bounded. The loop terminates at satisfaction or 10 refinement rounds. Empirically, hard tasks on DABStep average 5.6 rounds; easy tasks average 3.0; "over half of the easy tasks were completed in just a single round" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Canonical numeric instance of a refinement-round budget with task-difficulty-conditioned iteration counts.
  6. State-of-the-art across three data-science benchmarks, not just one. Versus the best prior baseline (AutoGen / DA-Agent): DABStep 41.0 % → 45.2 % (+4.2), KramaBench 39.8 % → 44.7 % (+4.9), DA-Code 37.0 % → 38.5 % (+1.5); top rank on DABStep public leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The "strong advantage in hard tasks that require processing multiple, heterogeneous data files" is the empirical payoff of the Data File Analyzer + iterative refinement combination.
  7. The framework is LLM-swappable. Tested with both Gemini-2.5-Pro (default) and GPT-5 as the base model. GPT-5 slightly better on easy tasks; Gemini-2.5-Pro better on hard tasks — but both work, demonstrating the architecture is not a Gemini-specific trick (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). Echoes the cross-model-portability result Dropbox Dash obtained for its LLM judge via DSPy retargeting — the architectural primitive is the portable thing, not the underlying model.
  8. Expert-analyst-in-notebook is the named mental model. "DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent). The architecture is explicitly derived from human analyst workflow — incremental, verification-gated, with backtracking.

Systems extracted

  • systems/ds-star — the agent framework itself.
  • systems/dabstep — the primary benchmark; DS-STAR ranked #1 on its public leaderboard at 2025-09-18.
  • systems/autogen — one of the two named baselines (Microsoft's multi-agent conversation framework).
  • DA-Agent — second baseline, referenced as the DA-Code benchmark's accompanying agent; no dedicated wiki page created (no independent architectural content in the post beyond the comparison).
  • KramaBench, DA-Code — additional benchmarks; no dedicated wiki pages (benchmark references only).
  • Gemini-2.5-Pro, GPT-5 — base LLMs used; neither gets a dedicated wiki page (the post treats them as swappable substrate, not the subject).

Concepts extracted

Patterns extracted

Operational numbers

Metric Value Scope
DABStep hard-task accuracy, full system 45.2 % vs AutoGen/DA-Agent
DABStep hard-task accuracy, no analyzer (Variant 1) 26.98 % ablation
DABStep improvement over best baseline 41.0 % → 45.2 % (+4.2) benchmark
KramaBench improvement over best baseline 39.8 % → 44.7 % (+4.9) benchmark
DA-Code improvement over best baseline 37.0 % → 38.5 % (+1.5) benchmark
Max refinement rounds 10 loop budget
Avg rounds on hard tasks 5.6 DABStep
Avg rounds on easy tasks 3.0 DABStep
Easy tasks completed in 1 round >50 % DABStep
DABStep public leaderboard rank #1 (as of 2025-09-18) leaderboard

Caveats

  • Raw capture is extremely thin. The locally-saved raw markdown contains only the "In-depth analysis" ablation paragraphs; the motivation, architecture walk-through, evaluation table, and round-count figure live in the original blog post body, retrieved in-session and quoted here with the URL cited verbatim. Wiki pages reflect what the full post verifiably contains.
  • Backing paper not ingested. The paper at arXiv 2509.21825 likely contains the Verifier prompt, Router decision logic, per-agent latency + token cost, production-deployment disclosure, and per-benchmark decomposition of the improvement. None are in the blog post.
  • Ablation table numbers are partially redacted in the prose. The post narrates Variant 1 and Variant 2 qualitatively but publishes the full ablation table only as an image ("DS-STAR - table"). Exact Variant 2 accuracies and the Gemini-2.5-Pro vs GPT-5 per-task-difficulty numbers are in the image, not the prose.
  • No production-deployment disclosure. The post positions DS-STAR as research; no Google Cloud productisation status, no customer reference, no cost / latency / throughput numbers at scale. DABStep ranking is on a public leaderboard, not a production workload.
  • "Hard task" is DABStep's definition, not a universal one. Hard = requires multiple data files; easy = single file. The refinement-round budget (5.6 vs 3.0 rounds) is specific to that decomposition and shouldn't be transplanted to other task-difficulty framings without recalibration.
  • The Router is itself an LLM, not a hard-coded classifier. The post describes it as an "agent" with the add-or-fix decision responsibility, but doesn't disclose its prompt, its failure modes, or its cost vs. the Planner / Coder / Verifier agents.

Source

Last updated · 200 distilled / 1,178 read