SYSTEM Cited by 1 source
DABStep¶
DABStep is a public benchmark for data-science agents, hosted as a Hugging Face Space by Adyen with a public leaderboard. It appears on the sysdesign-wiki as the primary evaluation surface for Google Research's DS-STAR agent, which ranked #1 on the DABStep leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
What it evaluates¶
DABStep scores agents on data-science tasks that require processing multiple, heterogeneous data files — CSV, JSON, markdown, unstructured text — rather than only well-structured tabular data. Tasks are split into:
- Easy tasks — the answer is contained in a single file.
- Hard tasks — the answer requires joining or reasoning across multiple files.
The split is the canonical difficulty axis DS-STAR's round-count analysis is conditioned on (3.0 avg rounds on easy; 5.6 on hard).
Reference numbers from DS-STAR (2025-11-06)¶
| Score | Value |
|---|---|
| Best prior baseline (AutoGen / DA-Agent) | 41.0 % |
| DS-STAR, full system | 45.2 % |
| DS-STAR, no Data File Analyzer (Variant 1) | 26.98 % on hard tasks |
| DS-STAR public leaderboard rank (2025-09-18) | #1 |
The 26.98 % number is informative beyond DS-STAR itself: it sets a floor for what hard-task DABStep performance looks like when an agent lacks rich data context up-front, and therefore a rough benchmark anchor for any competitor that chooses to plan without a file-inspection pre-step.
Related benchmarks¶
The DS-STAR post names two siblings, both benchmark-reference only on this wiki (no dedicated pages):
- KramaBench — data-wrangling benchmark; DS-STAR: 39.8 % → 44.7 %.
- DA-Code — multi-source data-science tasks; DS-STAR: 37.0 % → 38.5 %.
Caveats¶
- The full DABStep task taxonomy, scoring rubric, and per-category weights are not documented in the DS-STAR blog post; consult the Hugging Face Space and its paper (arXiv 2506.23719) for the specification.
- Leaderboard rank is a public-leaderboard snapshot, not a production or in-situ performance metric.
Seen in¶
- sources/2025-11-06-google-ds-star-versatile-data-science-agent — primary source on this wiki; DS-STAR's headline result set.
Related¶
- systems/ds-star — #1-ranked agent as of 2025-09-18.
- systems/autogen — named baseline.
- concepts/heterogeneous-data-formats — the problem class DABStep evaluates on.
- companies/google — DS-STAR's author.