Skip to content

SYSTEM Cited by 1 source

DABStep

DABStep is a public benchmark for data-science agents, hosted as a Hugging Face Space by Adyen with a public leaderboard. It appears on the sysdesign-wiki as the primary evaluation surface for Google Research's DS-STAR agent, which ranked #1 on the DABStep leaderboard as of 2025-09-18 (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

What it evaluates

DABStep scores agents on data-science tasks that require processing multiple, heterogeneous data files — CSV, JSON, markdown, unstructured text — rather than only well-structured tabular data. Tasks are split into:

  • Easy tasks — the answer is contained in a single file.
  • Hard tasks — the answer requires joining or reasoning across multiple files.

The split is the canonical difficulty axis DS-STAR's round-count analysis is conditioned on (3.0 avg rounds on easy; 5.6 on hard).

Reference numbers from DS-STAR (2025-11-06)

Score Value
Best prior baseline (AutoGen / DA-Agent) 41.0 %
DS-STAR, full system 45.2 %
DS-STAR, no Data File Analyzer (Variant 1) 26.98 % on hard tasks
DS-STAR public leaderboard rank (2025-09-18) #1

The 26.98 % number is informative beyond DS-STAR itself: it sets a floor for what hard-task DABStep performance looks like when an agent lacks rich data context up-front, and therefore a rough benchmark anchor for any competitor that chooses to plan without a file-inspection pre-step.

The DS-STAR post names two siblings, both benchmark-reference only on this wiki (no dedicated pages):

  • KramaBench — data-wrangling benchmark; DS-STAR: 39.8 % → 44.7 %.
  • DA-Code — multi-source data-science tasks; DS-STAR: 37.0 % → 38.5 %.

Caveats

  • The full DABStep task taxonomy, scoring rubric, and per-category weights are not documented in the DS-STAR blog post; consult the Hugging Face Space and its paper (arXiv 2506.23719) for the specification.
  • Leaderboard rank is a public-leaderboard snapshot, not a production or in-situ performance metric.

Seen in

Last updated · 200 distilled / 1,178 read