| CARVIEW |
VisPlotBench
Overview
VisPlotBench is a standardized benchmark designed to evaluate visualization coding agents across multiple programming languages. It covers eight visualization languages and includes 888 diverse visualization tasks. Each task pairs a natural language instruction with its corresponding rendered visual and is annotated with both a Visual Category and a Subtype, spanning a total of 13 categories. This design enables fine-grained analysis of model capabilities in understanding, generating, and correcting visualization code across symbolic, declarative, and procedural paradigms.
Overview of VisPlotBench. The benchmark covers eight visualization languages and contains 888 diverse visualization tasks, each combining a natural language instruction and a rendered visual. Tasks are annotated with a Visual category and a Subtype, spanning 13 categories in total.
Existing visualization benchmarks are narrow in scope: most cover a single language, few chart families, and no iterative debugging. VisPlotBench fills these gaps with 888 tasks across eight languages and 13 Visual categories. The taxonomy spans common families such as Bars, Lines, and Scatter, while adding rarely represented ones like Hierarchies, Music, and Networks & Flows. Each task combines a natural language instruction, executable code, and a rendered output, enabling execution-grounded evaluation. With its execute–render–score protocol and multi-round self-debug loop, VisPlotBench provides the first systematic benchmark for assessing visualization coding agents across languages and task types.
The table above positions VisPlotBench among representative benchmarks across four dimensions:
language coverage, visual categories, self-debug support, and dataset size.
Earlier resources remain narrow—focusing on Python or Vega-Lite,
with limited chart types and no iterative debugging.
VisCoder introduced self-debugging for PandasPlotBench, while VisPlotBench generalizes this to eight languages,
expands coverage to 13 categories (including Hierarchies, Music, and Networks & Flows),
and standardizes evaluation for systematic cross-language assessment.
Data Collection and Curation
We curate a high-quality pool of visualization tasks from multiple open datasets and repositories, ensuring broad coverage across both general-purpose and domain-specific visualization frameworks. Each example is verified for executability, correct rendering, and consistent natural language pairing.
Sources include the-stack-v2, svg-diagrams, and CoSyn-400K,
spanning languages such as Python, JavaScript, LaTeX,
Asymptote, Vega-Lite, and LilyPond.
During curation, invalid, monochrome, or non-visual outputs are filtered, and missing inputs are synthetically reconstructed to ensure each sample executes independently.
Task Construction
Each task in VisPlotBench combines a runnable visualization script with a natural language prompt that describes its intended output. Tasks are categorized into 13 major visual types, such as statistical plots, geometric diagrams, music scores, network graphs, and typographic layouts, each subdivided into finer-grained subtypes.
The benchmark emphasizes diversity and interpretability: prompts include sufficient context to test an agent’s understanding of syntax, semantics, and rendering logic, rather than mere text-to-code matching. By covering both declarative and procedural paradigms, VisPlotBench evaluates how well models generalize across visualization styles and language conventions.
Evaluation Protocol
Evaluation follows a unified execute–render–score protocol. Each model-generated code snippet is executed in a sandboxed environment specific to its target language, rendered into an image, and compared against the reference visualization.
Quantitative metrics include execution success rate, structural similarity between generated and reference visuals, and semantic alignment scores derived from visual encoders. This consistent evaluation pipeline ensures comparability across languages, promoting fair benchmarking of multi-language visualization agents.
Experiment Results
Main Results
We evaluate both proprietary and open-source models on VisPlotBench to compare execution reliability across parameter scales, programming languages, and evaluation modes. Proprietary references include GPT-4.1 and its lighter variant GPT-4.1-mini, while open-source baselines include DeepSeek-Coder, DeepSeek-CoderV2, Qwen2.5-Coder, and VisCoder. Our VisCoder2 models are trained on VisCode-Multi-679K using Qwen2.5-Coder backbones at 3B, 7B, 14B, and 32B scales.
Overall execution pass rate (%) of selected models on the VisPlotBench benchmark. The best-performing model in each scale is shown in bold, and the second best is underlined.
Proprietary Models Remain Stronger
GPT-4.1 achieves 63.4% overall, the highest among reference models, and GPT-4.1-mini follows closely. Both perform strongly on standardized declarative or markup languages such as Vega-Lite, SVG, and HTML, all above 84%. In contrast, instruction-tuned open-source models remain far behind. At the 7B scale, Qwen2.5-Coder reaches only 51.2% overall, with fewer than 30% on LaTeX and just 5.5% on LilyPond. Previous VisCoder variants improve Python performance but fail to generalize across languages. These results underline the substantial gap between proprietary and open-source models.
Cross-Language Variation
Performance differs sharply across visualization languages. Vega-Lite and HTML are close to saturation for most models, while Python shows steady gains with scale. By contrast, symbolic and compiler-dependent languages remain the most difficult. Even GPT-4.1 achieves less than 45% on LilyPond and under 25% on Asymptote, and open-source baselines fall much lower. This uneven landscape highlights that progress on symbolic grammars is the key bottleneck for reliable multi-language visualization.
VisCoder2 Advantage
Across all scales, VisCoder2 consistently outperforms size-matched open-source baselines. At 32B, it improves overall execution pass rate by approximately 15 points compared with Qwen2.5-Coder and reaches parity with GPT-4.1. The only consistent shortfall is on SVG, where VisCoder2 trails the strongest baseline by over 10 points. Overall, VisCoder2 is the first open-source model to match proprietary reliability on executable visualization tasks.
Effect of Self-Debug
Iterative correction consistently improves execution reliability across model families and scales. Proprietary models benefit strongly, and VisCoder2 follows the same trend: at larger scales, overall execution rises by nearly ten points when self-debugging is enabled. The effect is especially pronounced for symbolic and compiler-dependent languages such as LilyPond, LaTeX, and Asymptote, where fragile syntax or compilation errors dominate. Self-debugging enables the model to repair these shallow but frequent failures, allowing models to resolve previously intractable failures into valid outputs. This demonstrates that feedback-driven refinement is not just a marginal improvement but a critical mechanism for tackling the hardest visualization languages.
Task and Visual Score Analysis
We analyze Task Score and Visual Score on three representative languages that highlight different behaviors:
LaTeX illustrates execution–semantics mismatch, LilyPond shows the largest gains on symbolic grammars,
and SVG exposes model–library sensitivity where semantic and perceptual signals diverge.
Results for all languages and scales are provided in the appendix.
Performance of selected languages on the VisPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.
LaTeX: Execution–Semantics Mismatch
Models often capture the intended structure of a figure but fail to compile reliably. For example, GPT-4.1 improves from 31.3% to 66.1% execution pass rate with Self-Debug, while task scores remain around 50 even when execution fails. VisCoder2 raises execution and task scores compared with baselines, but compilation errors remain frequent. This pattern indicates that semantic alignment does not always translate into successful rendering.
LilyPond: Symbolic Grammar Gains
VisCoder2 delivers the clearest advantage on symbolic languages. At 7B, Qwen2.5-Coder executes only 5.5% of tasks, while VisCoder2 reaches 69.1% and further improves with Self-Debug. The proportion of examples with task scores above 75 also increases by more than tenfold. These results show that targeted coverage of symbolic grammars in VisCode-Multi-679K translates directly into reliable generation and semantic adherence.
SVG: Sensitivity to Rendering Libraries
Execution success is high across most models, yet visual scores lag behind task scores. For instance, GPT-4.1 with Self-Debug achieves 95.4% execution and a task score near 90, but the average visual score is below 50. VisCoder2 performs competitively but trails Qwen2.5 on execution at larger scales (81.5% versus 93.9% at 32B). These discrepancies suggest that evaluation on SVG is strongly influenced by library-specific rendering details rather than semantic understanding alone.
Error Analysis
To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).
Execution error transitions for VisCoder-7B across four representative error types.
Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError)
are often resolved, while semantic failures (e.g., KeyError) persist.
Error Analysis Across Eight Languages
To better understand failure modes across languages, we analyze execution errors before and after self-debug.
Many language-specific exceptions, such as FunctionSignatureError in Asymptote or MarkupError in LilyPond,
were merged into four broader categories for clarity: Structural Errors (syntax or parsing),
Type & Interface Errors (invalid calls or arguments), Semantic / Data Errors (mismatched variables or values),
and Runtime / Environment Errors (renderer or package issues).
Representative results for VisCoder2-32B are shown below, demonstrating error transitions from initial failure to final self-debug round.
Effective Recovery on Structural and Interface Errors
Self-debug effectively reduces shallow errors such as missing tokens or invalid arguments across multiple languages. For example, Python interface errors fall from 13 to 3, and structural errors in LilyPond decrease from 14 to 10. Mermaid and Asymptote show the same trend, with syntax and function signature errors shrinking after correction (Asymptote structural errors drop from 9 to 3). These cases benefit from explicit diagnostic traces, making them relatively easy to fix through iterative feedback.
Persistent Failures in Semantic and Runtime Errors
Errors involving semantics or execution environments remain difficult to resolve. In LaTeX, undefined variables decrease only slightly (28 to 23), and Asymptote variable mismatches improve only marginally (15 to 11). Renderer failures such as Vega-Lite rendering errors (2 to 2) and HTML request failures (3 to 2) often persist across all rounds. These errors require deeper reasoning over symbolic grammars and runtime contexts, which current self-debug protocols cannot fully capture. Symbolic languages and renderer-sensitive environments therefore remain the dominant bottlenecks, pointing to the need for grammar-aware training objectives and more robust runtime integration.
Case Study Examples
Python – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Python – Self-Debug Recovery: The initial code raises a ValueError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
Python – Self-Debug Failed: The initial code raises a AttributeError and is still failed after three rounds self-debug.
Vega-Lite – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Vega-Lite – Self-Debug Recovery: The initial code raises a TypeError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
Vega-Lite – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
Lilypond – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Lilypond – Self-Debug Recovery: The initial code raises a SyntaxError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
Lilypond – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
Mermaid – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Mermaid – Self-Debug Recovery: The initial code raises a SyntaxError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
Mermaid – Self-Debug Failed: The initial code raises a AttributeError and is still failed after three rounds self-debug.
SVG – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
SVG – Self-Debug Recovery: The initial code raises a ExPatError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
SVG – Self-Debug Failed: The initial code raises a ParseError and is still failed after three rounds self-debug.
LaTeX – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
LaTeX – Self-Debug Recovery: The initial code raises a NameError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
LaTeX – Self-Debug Failed: The initial code raises a NameError and is still failed after three rounds self-debug.
Asymptote – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Asymptote – Self-Debug Recovery: The initial code raises a NameError and is resolved in the third round of self-debug, resulting in a corrected plot that matches the intended semantics.
Asymptote – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
HTML – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
HTML – Self-Debug Recovery: The initial code raises a ImportError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
HTML – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
BibTeX
@article{ni2025viscoder2,
title={VisCoder2: Building Multi-Language Visualization Coding Agents},
author={Ni, Yuansheng and Cai, Songcheng and Chen, Xiangchao and Liang, Jiarong and Lyu, Zhiheng and Deng, Jiaqi and Zou, Kai and Nie, Ping and Yuan, Fei and Yue, Xiang and others},
journal={arXiv preprint arXiv:2510.23642},
year={2025}
}
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}