| CARVIEW |
Experiment Results
Main Results
We present the main experimental results on PandasPlotBench, including overall model comparisons, performance under the self-debug evaluation protocol, error type analysis, and a training data ablation study.
We evaluate VisCoder models against both proprietary and open-source language models to assess executable visualization performance across scales and libraries. The proprietary group includes GPT-4o and GPT-4o-mini. Among open-source baselines, we compare LLaMA-3.2-3B, LLaMA-3.1-8B, Qwen2.5-Instruct, and Qwen2.5-Coder-Instruct at both 3B and 7B scales. VisCoder models are trained on VisCode-200K and fine-tuned using the same instruction tuning setup.
Performance of selected models on the PandasPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.
Proprietary Models Remain Stronger
Proprietary models outperform open-source models by a wide margin across all plotting libraries. GPT-4o achieves the highest execution pass rates and the strongest judge-based scores, followed by its lightweight variant GPT-4o-mini. These results indicate more reliable execution and better semantic alignment with task instructions, especially in complex visualization settings. In contrast, open-source models like LLaMA and Qwen2.5-Instruct underperform consistently across all metrics.
Plotly Presents Harder Challenge
Performance varies across plotting libraries. While most models perform reliably on
matplotlib and seaborn, results on plotly are significantly lower,
especially for open-source models. Execution pass rates often drop below 35%, and task/visual scores degrade accordingly.
Generated plots frequently fail to reflect the intended semantics or completeness, revealing the challenge posed by
plotly's verbose syntax and less frequent API exposure in training corpora.
VisCoder Closes the Open-Source Gap
VisCoder significantly outperforms its Qwen2.5-Coder-Instruct baselines across all libraries.
At the 3B scale, it improves execution success and semantic alignment, especially on plotly and seaborn.
At 7B, VisCoder even outperforms GPT-4o-mini on these two libraries, although slightly trailing on matplotlib.
These gains highlight the impact of domain-specific instruction tuning for visualization code generation.
Self-Debug Further Boosts Performance
GPT-4o demonstrates strong self-debugging capabilities, reaching near-perfect execution success with multiple attempts.
VisCoder also benefits substantially under this evaluation protocol. VisCoder-7B surpasses 90% execution rate on
matplotlib and seaborn, with large gains in task and visual scores across correction rounds.
These results show VisCoder’s ability to generalize debugging behaviors learned from training, even without plot-specific correction examples.
Self-Debug Evaluation
To analyze the dynamics of self-debugging, we track execution pass rates over multiple correction rounds by evaluating GPT-4o and GPT-4o-mini as proprietary baselines, alongside VisCoder models at 3B and 7B scales. To isolate the effects of instruction tuning, we also include untuned Qwen2.5-Coder models at matching sizes. The chart below shows execution pass rates from the initial generation (Attempt 0) through three rounds of self-debugging (Attempts 1–3), presented separately for each plotting library.
Execution pass rate across self-debug rounds (Attempt 0–3), shown separately for three plotting libraries.
Attempt 0 corresponds to the default output, while Attempts 1–3 represent subsequent correction rounds.
Model groups are color-coded, with solid and dashed lines used to distinguish paired models.
VisCoder models improve consistently across rounds, with VisCoder-7B gradually closing the gap to GPT-4o on
seaborn. Y-axis ranges are scaled per subplot to match library-specific score distributions.
Self-Debug Is Broadly Effective
Execution pass rates increase steadily over self-debug rounds for most models and libraries, indicating the overall effectiveness of the protocol. The first attempt typically yields the largest improvement, with smaller gains in subsequent rounds. This pattern suggests that a simple retry mechanism informed by execution feedback can recover a substantial portion of initial failures.
VisCoder Yields Stable Behavior
Compared to their Qwen2.5-Coder baselines, VisCoder models show smaller per-round gains
but consistently achieve higher final performance. This indicates that VisCoder tends to generate
stronger initial outputs and applies more stable corrections.
VisCoder-7B is particularly strong on seaborn, approaching GPT-4o by the final round.
Failures Remain Across Models
Even the strongest model GPT-4o does not reach perfect execution after self-debug.
Its performance on seaborn plateaus after three rounds, leaving non-trivial failure cases.
In contrast, VisCoder-3B stands out among smaller models — outperforming GPT-4o-mini on seaborn
and performing competitively elsewhere. Smaller models generally plateau earlier with fewer gains.
Error Analysis
To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).
Execution error transitions for VisCoder-7B across four representative error types.
Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError)
are often resolved, while semantic failures (e.g., KeyError) persist.
Effective Recovery from Structural Errors
VisCoder-7B demonstrates strong self-correction ability on shallow, structural errors.
AttributeErrors in Seaborn are reduced from 15 to 2, and TypeErrors in Plotly from 3 to 1.
These errors usually stem from invalid method calls or argument mismatches and are easily identified by diagnostic outputs.
VisCoder learns to correct them consistently through retry-based feedback.
Persistent Failures in Semantic Execution Errors
Semantic failures such as KeyError and ValueError are harder to resolve.
On Plotly, ValueErrors drop only slightly (29 → 23), while KeyErrors remain unchanged.
These errors require dynamic reasoning about data structures, but VisCoder’s retry attempts often rely on the same faulty assumptions.
Symbolic corrections alone are insufficient for resolving such semantically grounded failures.
Case Study
To illustrate model behavior across different plotting libraries and demonstrate the effectiveness of self-debugging, we present representative examples from VisCoder-7B.
For each library—matplotlib, seaborn, and plotly—we show both successful generations and failure cases recovered through multi-round correction.
These cases reflect the model's ability to correct common structural errors such as AttributeError and ValueError, while also highlighting persistent challenges in more semantic failures.
Matplotlib – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Matplotlib – Self-Debug Recovery: An AttributeError raised during initial generation is corrected in the first debug round, resulting in a valid plot.
Seaborn – Successful Generation: Code executes correctly on the first attempt and produces a semantically aligned plot.
Seaborn – Self-Debug Recovery: An AttributeError is fixed after three rounds of debugging, yielding a corrected and faithful plot.
Plotly – Successful Generation: The model correctly generates and executes a visualization that aligns with expected output.
Plotly – Self-Debug Recovery: A ValueError is corrected in the second debug round, producing a valid final result.
VisCoder: Executable Python Visualization
Conclusion
In conclusion, VisCode-200K provides a large-scale instruction tuning dataset for Python visualization code generation,
combining executable plotting examples with multi-turn correction dialogues grounded in runtime feedback.
To validate its effectiveness, we evaluate VisCoder models on PandasPlotBench using the default setting.
Additionally, we propose a self-debug protocol to simulate realistic correction workflows and assess model performance in this extended evaluation mode.
Experiments show that VisCoder substantially outperforms strong open-source baselines across execution and alignment metrics,
and narrows the gap to proprietary models like GPT-4o-mini.
Gains are particularly pronounced in settings that involve complex visualization structures, such as plotly,
and iterative correction through self-debugging.
Ablation studies further demonstrate that structurally diverse, executable training data and feedback-driven supervision contribute to more robust performance across plotting libraries.
Looking forward, this work reinforces the importance of domain-specific instruction tuning and multi-turn correction supervision for building robust and semantically grounded visualization-capable models. Future extensions may explore broader plotting libraries, richer correction supervision, and evaluation methods that measure models' abilities to recover from execution errors.
BibTeX
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}