Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

VisPlotBench

Overview

VisPlotBench is a standardized benchmark designed to evaluate visualization coding agents across multiple programming languages. It covers eight visualization languages and includes 888 diverse visualization tasks. Each task pairs a natural language instruction with its corresponding rendered visual and is annotated with both a Visual Category and a Subtype, spanning a total of 13 categories. This design enables fine-grained analysis of model capabilities in understanding, generating, and correcting visualization code across symbolic, declarative, and procedural paradigms.

Overview of VisPlotBench

Overview of VisPlotBench. The benchmark covers eight visualization languages and contains 888 diverse visualization tasks, each combining a natural language instruction and a rendered visual. Tasks are annotated with a Visual category and a Subtype, spanning 13 categories in total.

Existing visualization benchmarks are narrow in scope: most cover a single language, few chart families, and no iterative debugging. VisPlotBench fills these gaps with 888 tasks across eight languages and 13 Visual categories. The taxonomy spans common families such as Bars, Lines, and Scatter, while adding rarely represented ones like Hierarchies, Music, and Networks & Flows. Each task combines a natural language instruction, executable code, and a rendered output, enabling execution-grounded evaluation. With its execute–render–score protocol and multi-round self-debug loop, VisPlotBench provides the first systematic benchmark for assessing visualization coding agents across languages and task types.

Comparison with existing benchmarks

The table above positions VisPlotBench among representative benchmarks across four dimensions: language coverage, visual categories, self-debug support, and dataset size. Earlier resources remain narrow—focusing on Python or Vega-Lite, with limited chart types and no iterative debugging. VisCoder introduced self-debugging for PandasPlotBench, while VisPlotBench generalizes this to eight languages, expands coverage to 13 categories (including Hierarchies, Music, and Networks & Flows), and standardizes evaluation for systematic cross-language assessment.

Experiment Results

Main Results

We evaluate both proprietary and open-source models on VisPlotBench to compare execution reliability across parameter scales, programming languages, and evaluation modes. Proprietary references include GPT-4.1 and its lighter variant GPT-4.1-mini, while open-source baselines include DeepSeek-Coder, DeepSeek-CoderV2, Qwen2.5-Coder, and VisCoder. Our VisCoder2 models are trained on VisCode-Multi-679K using Qwen2.5-Coder backbones at 3B, 7B, 14B, and 32B scales.

main results

Overall execution pass rate (%) of selected models on the VisPlotBench benchmark. The best-performing model in each scale is shown in bold, and the second best is underlined.

Task and Visual Score Analysis

We analyze Task Score and Visual Score on three representative languages that highlight different behaviors: LaTeX illustrates execution–semantics mismatch, LilyPond shows the largest gains on symbolic grammars, and SVG exposes model–library sensitivity where semantic and perceptual signals diverge. Results for all languages and scales are provided in the appendix.

task and visual score analysis

Performance of selected languages on the VisPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.

Error Analysis

To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).

Error Table

Execution error transitions for VisCoder-7B across four representative error types. Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError) are often resolved, while semantic failures (e.g., KeyError) persist.

Case Study Examples

BibTeX


          @article{ni2025viscoder2,
            title={VisCoder2: Building Multi-Language Visualization Coding Agents},
            author={Ni, Yuansheng and Cai, Songcheng and Chen, Xiangchao and Liang, Jiarong and Lyu, Zhiheng and Deng, Jiaqi and Zou, Kai and Nie, Ping and Yuan, Fei and Yue, Xiang and others},
            journal={arXiv preprint arXiv:2510.23642},
            year={2025}
          }
          @article{ni2025viscoder,
            title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
            author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
            journal={arXiv preprint arXiv:2506.03930},
            year={2025}
          }