Abstract: This paper introduces the human-curated Pandas-PlotBench dataset, designed to evaluate language models’ effectiveness as assistants in visual data exploration. Our benchmark focuses on ...