- 2025-09-23: 🔥🔥🔥Our paper is accepted on NeurIPS 2025!!!
- 2025-06-16: 🔥Our paper is accepted on ICML 2025 workshop EXAIT.
- 2025-05-23: Our paper is available on arxiv.
- 2025-05-19: We update the next-version code, model and data source.
- 2025-03-28: We release the TON repo.
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy:
- (i) A supervised fine-tuning (SFT) stage with a simple yet effective “thought dropout” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
- (ii) A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.
Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.
We release our models both in huggingface and modelscope.
| huggingface | Modelscope | |
|---|---|---|
| 🤗 TON-3B-AITZ | TON-3B-AITZ | TON-3B-AITZ |
| 🤗 TON-3B-CLEVR | TON-3B-CLEVR | TON-3B-CLEVR |
| 🤗 TON-3B-Math | TON-3B-Math | TON-3B-Math |
| 🤗 TON-7B-Math | TON-7B-Math | TON-7B-Math |
We release our training datasets both in huggingface and modelscope.
| huggingface | Modelscope | |
|---|---|---|
| 🤗 AITZ-SFT | AITZ-SFT | AITZ-SFT |
| 🤗 AITZ-GRPO | AITZ-GRPO | AITZ-GRPO |
| 🤗 Math-SFT | Math-SFT | Math-SFT |
| 🤗 Math-GRPO | Math-GRPO | Math-GRPO |
- Supported Evaluations
- AITZ: Mobile Agent Navigation Problems
- GeoQA-Test: Geometry Reasoning
The comparsion of the completion length and skip think ratio during the training process between our TON and vanilla GRPO is as follows:
conda create -n r1-v python=3.11
conda activate r1-v
bash setup.sh
- Raw datasets: download and unzip the raw data of three training datasets into the
datasetfolder and one evaluation benchmark into thesrc/evalfolder with the following links.
| datasets | referred download links |
|---|---|
| AITZ | https://drive.usercontent.google.com/download?id=12xOV2m62fBUFLhMcWIsFiC6zwV7a2RhI&export=download&authuser=0 |
| GeoQA | https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K/viewer |
| CLEVR (item couting) | https://huggingface.co/datasets/leonardPKU/clevr_cogen_a_train |
| Super-CLEVR (item counting) | The official dataset link is 404 error now 😭. |
| GeoQA-test | https://huggingface.co/datasets/Luckyjhg/Geo170K |
- download GeoQA-test
cd src/eval
git lfs install
git clone https://huggingface.co/datasets/Luckyjhg/Geo170K
cd Geo170K
unzip images.zip
- Preprocessed SFT datasets: We submit our SFT stage datasets by the random thought dropout ratio 0.5 in TON. Run the following code to download them:
modelscope download --dataset wjqkoko/TON-AITZ-SFT ----local_dir <your local path>
modelscope download --dataset wjqkoko/TON-Math-SFT ----local_dir <your local path>
- AITZ
- Preprocess the train split, general domain of AITZ
python src/eval/aitz_evaluate/process_data.py
- change the process data to the SFT and GRPO format of AITZ in
datasetfolder
python src/eval/aitz_evaluate/sft_grpo_data.py
TODO: CLEVR and GeoQA
Supported models Qwen2.5-VL-3B/7B
modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir <your local path>
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir <your local path>
We also provide our models well-trained on AITZ, CLEVR, and GeoQA by our method TON in TON.
modelscope download --model wjqkoko/TON-7B-Math --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-Math --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-AITZ --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-CLEVR --local_dir <your local path>
-
SFT stage: We implement the sft stage by Llama-factory. Thanks the comprehensive tools. 😊
-
GRPO stage: The training command is as follows: you need to modify it by
- your local model path (QWEN_PATH),
- dataset path (HF_DATASET),
- and output save path(OUTPUT_DIR).
We also support wandb to monitor the training process by setting the run name (RUN_NAME).
For Counting/GeoQA
sh src/scripts/run_grpo_vllm_qwen25vl.sh
For AITZ
sh src/scripts/run_grpo_vllm_qwen25vl_gui.sh
Note
- If you meet OOM Error, you can try reduce
-num_generations
- AITZ
We currently give the evaluation codes of the training datasets on src/scripts/llama_factory_test.sh.
You need to first download the LLaMA-Factory and then configure the environments following its readme.
You need to first modify the AITZ raw data to the sft format following the same steps previously, then edit the dataset name and path in dataset_format.json.
You can test your model by running it to generate predicted results of AITZ:
sh src/scripts/llama_factory_test.sh
You can test your output results by running the following code:
python src/eval/aitz_evaluate/test_qwen25vl_aitz_from_json.py
- GeoQA Counting
To evaluate on GeoQA and counting, you need to first transform the data format by running the code:
python src/eval/parquet_data.py
- You can test your model by running it to evaluate math:
python src/eval/test_qwen25vl_geoqa.py
- You can test your model by running it to evaluate counting:
python src/eval/test_qwen25vl_counting_superclevr.py
Jiaqi Wang · Qinghong Lin· Binghui Xie· Dongchi Huang · Ming Hu · Xiaojun Guo· Qixun Wang · Qiguang Chen · James Cheng · Mike Z. SHOU
We sincerely thank DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V (our initial codebase). We sincerely thank Dongchi Huang for his invaluable guidance on the code and for providing essential computational resources. We also appreciate Binghui Xie’s insightful discussion on topic selection and idea suggestions. Additionally, we are grateful to Qiguang Chen and Yuxuan Wan for their thoughtful and constructive feedback on this paper. Finally, we extend our gratitude to Xiaojun Guo and Qixun Wang for their valuable advice on visual reasoning and the GRPO series methods.
If you find this work useful, please give us a free cite:
@misc{wang2025think,
title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
year={2025},
eprint={2505.16854},
archivePrefix={arXiv},
primaryClass={cs.AI}
}





