Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Experiment Results

Main Results

We present the main experimental results on PandasPlotBench, including overall model comparisons, performance under the self-debug evaluation protocol, error type analysis, and a training data ablation study.

We evaluate VisCoder models against both proprietary and open-source language models to assess executable visualization performance across scales and libraries. The proprietary group includes GPT-4o and GPT-4o-mini. Among open-source baselines, we compare LLaMA-3.2-3B, LLaMA-3.1-8B, Qwen2.5-Instruct, and Qwen2.5-Coder-Instruct at both 3B and 7B scales. VisCoder models are trained on VisCode-200K and fine-tuned using the same instruction tuning setup.

main results

Performance of selected models on the PandasPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.

Self-Debug Evaluation

To analyze the dynamics of self-debugging, we track execution pass rates over multiple correction rounds by evaluating GPT-4o and GPT-4o-mini as proprietary baselines, alongside VisCoder models at 3B and 7B scales. To isolate the effects of instruction tuning, we also include untuned Qwen2.5-Coder models at matching sizes. The chart below shows execution pass rates from the initial generation (Attempt 0) through three rounds of self-debugging (Attempts 1–3), presented separately for each plotting library.

Self Debug Lineplot

Execution pass rate across self-debug rounds (Attempt 0–3), shown separately for three plotting libraries. Attempt 0 corresponds to the default output, while Attempts 1–3 represent subsequent correction rounds. Model groups are color-coded, with solid and dashed lines used to distinguish paired models. VisCoder models improve consistently across rounds, with VisCoder-7B gradually closing the gap to GPT-4o on seaborn. Y-axis ranges are scaled per subplot to match library-specific score distributions.

Error Analysis

To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).

Error Table

Execution error transitions for VisCoder-7B across four representative error types. Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError) are often resolved, while semantic failures (e.g., KeyError) persist.

Case Study

To illustrate model behavior across different plotting libraries and demonstrate the effectiveness of self-debugging, we present representative examples from VisCoder-7B. For each library—matplotlib, seaborn, and plotly—we show both successful generations and failure cases recovered through multi-round correction. These cases reflect the model's ability to correct common structural errors such as AttributeError and ValueError, while also highlighting persistent challenges in more semantic failures.

VisCoder: Executable Python Visualization

Conclusion

In conclusion, VisCode-200K provides a large-scale instruction tuning dataset for Python visualization code generation, combining executable plotting examples with multi-turn correction dialogues grounded in runtime feedback. To validate its effectiveness, we evaluate VisCoder models on PandasPlotBench using the default setting. Additionally, we propose a self-debug protocol to simulate realistic correction workflows and assess model performance in this extended evaluation mode.

Experiments show that VisCoder substantially outperforms strong open-source baselines across execution and alignment metrics, and narrows the gap to proprietary models like GPT-4o-mini. Gains are particularly pronounced in settings that involve complex visualization structures, such as plotly, and iterative correction through self-debugging. Ablation studies further demonstrate that structurally diverse, executable training data and feedback-driven supervision contribute to more robust performance across plotting libraries.

Looking forward, this work reinforces the importance of domain-specific instruction tuning and multi-turn correction supervision for building robust and semantically grounded visualization-capable models. Future extensions may explore broader plotting libraries, richer correction supervision, and evaluation methods that measure models' abilities to recover from execution errors.

BibTeX


          @article{ni2025viscoder,
            title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
            author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
            journal={arXiv preprint arXiv:2506.03930},
            year={2025}
          }