SYSTEM Cited by 1 source

DABStep¶

DABStep is a public benchmark for data-science agents, hosted as a Hugging Face Space by Adyen with a public leaderboard. It appears on the sysdesign-wiki as the primary evaluation surface for Google Research's DS-STAR agent, which ranked #1 on the DABStep leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

What it evaluates¶

DABStep scores agents on data-science tasks that require processing multiple, heterogeneous data files — CSV, JSON, markdown, unstructured text — rather than only well-structured tabular data. Tasks are split into:

Easy tasks — the answer is contained in a single file.
Hard tasks — the answer requires joining or reasoning across multiple files.

The split is the canonical difficulty axis DS-STAR's round-count analysis is conditioned on (3.0 avg rounds on easy; 5.6 on hard).

Reference numbers from DS-STAR (2025-11-06)¶

Score	Value
Best prior baseline (AutoGen / DA-Agent)	41.0 %
DS-STAR, full system	45.2 %
DS-STAR, no Data File Analyzer (Variant 1)	26.98 % on hard tasks
DS-STAR public leaderboard rank (2025-09-18)	#1

The 26.98 % number is informative beyond DS-STAR itself: it sets a floor for what hard-task DABStep performance looks like when an agent lacks rich data context up-front, and therefore a rough benchmark anchor for any competitor that chooses to plan without a file-inspection pre-step.

The DS-STAR post names two siblings, both benchmark-reference only on this wiki (no dedicated pages):

KramaBench — data-wrangling benchmark; DS-STAR: 39.8 % → 44.7 %.
DA-Code — multi-source data-science tasks; DS-STAR: 37.0 % → 38.5 %.

Caveats¶

The full DABStep task taxonomy, scoring rubric, and per-category weights are not documented in the DS-STAR blog post; consult the Hugging Face Space and its paper (arXiv 2506.23719) for the specification.
Leaderboard rank is a public-leaderboard snapshot, not a production or in-situ performance metric.

Seen in¶

sources/2025-11-06-google-ds-star-versatile-data-science-agent — primary source on this wiki; DS-STAR's headline result set.

systems/ds-star — #1-ranked agent as of 2025-09-18.
systems/autogen — named baseline.
concepts/heterogeneous-data-formats — the problem class DABStep evaluates on.
companies/google — DS-STAR's author.

DABStep¶

What it evaluates¶

Reference numbers from DS-STAR (2025-11-06)¶

Related benchmarks¶

Caveats¶

Seen in¶

Related¶