Skip to content

SYSTEM Cited by 1 source

DS-STAR

DS-STAR (introduced by Google Research, 2025-11-06) is a versatile data-science agent built from four specialised LLM sub-agents (Data File Analyzer, Planner, Coder, Verifier) plus a Router, arranged in an iterative plan-refinement loop. It achieves state-of-the-art results on the DABStep, KramaBench, and DA-Code benchmarks, ranking #1 on the DABStep public leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Architecture

Two-stage operation:

Stage 1 — Data context extraction. The Data File Analyzer agent writes and executes a Python script that scans every file in the working directory (CSV, JSON, markdown, unstructured text) and emits a textual summary of each file's structure and contents. This summary is the grounding context for the rest of the loop. Ablation: remove the analyzer and DABStep hard-task accuracy collapses 45.2 % → 26.98 % — the analyzer is load-bearing, not cosmetic (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Stage 2 — Plan / implement / verify loop.

  1. Planner produces a high-level plan.
  2. Coder transforms the plan into a Python script and executes it.
  3. Verifier — an LLM-based judge — inspects the code + intermediate outputs and decides whether the plan is adequate.
  4. If adequate: return the final code as the solution.
  5. If not: the Router decides whether to add a new step or fix an existing step, and the loop repeats.
  6. Loop terminates at satisfaction or 10 refinement rounds.

The add-or-fix branch is the architecturally distinguishing primitive. Ablation Variant 2 removes the Router (forcing extend-only); hard-task and easy-task performance both degrade — "it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Design stance

  • Mimic the expert analyst in a notebook. "DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding." Incremental, verification-gated, with backtracking — not a one-shot plan-and-generate.
  • Verify plans, not just outputs. The Verifier judges plan sufficiency at each step, not only the final answer. This is the only viable check for open-ended data-science problems that "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
  • Support heterogeneous data. The Data File Analyzer is the deliberate answer to the named limitation of prior data-science agents: "heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications."
  • LLM-swappable substrate. Tested with Gemini-2.5-Pro (default) and GPT-5; both work. GPT-5 slightly better on easy tasks, Gemini-2.5-Pro better on hard tasks — "indicating the framework's generalizability."

Benchmark results

Benchmark Best prior baseline DS-STAR Gain
DABStep (overall) 41.0 % 45.2 % +4.2
KramaBench 39.8 % 44.7 % +4.9
DA-Code 37.0 % 38.5 % +1.5
DABStep public leaderboard #1 as of 2025-09-18

Baselines: AutoGen and DA-Agent.

Refinement-round behaviour (DABStep)

Task difficulty Avg rounds 1-round completions
Easy (single-file) 3.0 >50 %
Hard (multi-file) 5.6
Budget ceiling 10

The round-count distribution is a direct measurement of refinement-round budget behaviour: difficulty-conditioned iteration with a hard ceiling.

Agents

  • Data File Analyzer — writes + executes a Python file-summarisation script; emits rich textual descriptions of every file in the working directory. Ablation-critical (concepts/data-file-analysis).
  • Planner — emits high-level plans (step sequences).
  • Coder — transforms plan into executable Python, runs it, captures intermediate results.
  • VerifierLLM-based judge; scores the current plan's sufficiency after each Coder execution.
  • Router — decides add-step vs. fix-step on Verifier rejection. The add-or-fix primitive, not just the router label, is load-bearing.

Wiki framing

  • Instantiates patterns/planner-coder-verifier-router-loop — the canonical architectural shape: plan → implement → verify → add-or-fix → repeat, with the Verifier as an LLM judge on plans and the Router as the explicit add-vs-fix decision point.
  • Instance of patterns/specialized-agent-decomposition — but with inner-loop verification rather than cross-domain routing. Sibling shape of Storex (per-domain specialists), Dash (classifier-routed sub-agents), and AWS Strands (three-agent K8s split); DS-STAR is the verification-gated inner-loop variant, not the cross-domain variant.
  • Extends concepts/llm-as-judge along a new axis: plan-sufficiency scoring inside the generation loop, vs. the earlier wiki instances which scored agent trajectories or outputs in eval / harness contexts post-hoc.
  • Adds a canonical numeric anchor to concepts/refinement-round-budget — 3.0 vs 5.6 average rounds with a 10-round ceiling, >50 % single-round completions on easy tasks.

Caveats

  • Paper-mediated depth. Verifier prompt, Router decision logic, per-agent latency / token cost, and production-deployment status are in the backing paper (arXiv 2509.21825), not in the blog post.
  • Research, not a product. No Google Cloud productisation announced; DABStep ranking is leaderboard, not production workload.
  • Benchmark-shaped evaluation. All three benchmarks are data-science task suites; generalisation to arbitrary analytics workflows is asserted but not demonstrated in the post.
Last updated · 200 distilled / 1,178 read