You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
🎯 Overview
VisTA is a reinforcement learning framework designed to enhance visual tool selection capabilities in multimodal AI systems. Our approach focuses on training agents to intelligently select and utilize appropriate visual tools for complex reasoning tasks.
Additional visual reasoning datasets with unified tooling interface (coming soon)
🔧 Training
Tool Selection Model Training
To train the visual tool selection model on ChartQA:
cd src/r1-v
./run_grpo.sh
🧪 Inference and Evaluation
Generate Tool Predictions
Update the model path in model_name_or_path inside run_grpo_test.sh, then execute:
./run_grpo_test.sh
Evaluate Tool-Based Reasoning
Run the following commands to evaluate:
python test_chartqa_gpt.py
python relax_test.py
📚 Citation
If you use VisTA in your research, please cite:
@misc{huang2025visualtoolagentvistareinforcementlearning,
title={VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection},
author={Zeyi Huang and Yuyang Ji and Anirudh Sundara Rajan and Zefan Cai and Wen Xiao and Junjie Hu and Yong Jae Lee},
year={2025},
eprint={2505.20289},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.20289},
}
About
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection