VIDEO-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
This is the official implementation for Video-RTS.
Authors: Ziyang Wang*, Jaehong Yoon*, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
We introduce Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy.
git clone https://github.com/Ziyang412/Video-RTS.git
cd Video-RTS
# build environment
conda create -n video-rts python=3.11
conda activate video-rts
bash setup.sh
# qwen video extraction setting, e.g., max frames, resolutions
# Use the [decord] feature to improve speed
cd src/qwen-vl-utils
pip install -e .[decord]
cd ..Following Video-R1, please install the provided version of transformers
unzip transformers-main.zip
cd ./transformers-main
pip install .Please refer to the official github of each dataset for video downloading.
For evaluation, we provide the annotation file in ./src/r1-v/Evaluation and please refer to the ./src/r1-v/Evaluation/path_coversion.py to update the video path.
For training, we provided the training data annotation in ./src/training_data and please refer to the CG-Bench repo for video data
We provided the model checkpoint in Huggingface, noted that the model is only trained on about 2k samples but yield similar performance with the 6k sample training.
We use the Open-R1-Video as trainig codebased. We provided our modification files in ./src/training_files so please replace the exact same files in the original repo. You could also use the Video-R1 as training codebase, we find the results are similar.
Please update the input model / file name / output file in the given bash file. After running the inference code, please update the json_path in cal_results_acc.py to calculate the final video reasoning accuracy.
bash src/video_rts_eval.sh
python src/cal_results_acc.pyWe thank the developers of Open-R1-Video, Video-R1, Qwen-2.5-VL and TRL for their public code release.
Please cite our paper if you use our models in your works:
@misc{wang2025videortsrethinkingreinforcementlearning,
title={Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning},
author={Ziyang Wang and Jaehong Yoon and Shoubin Yu and Md Mohaiminul Islam and Gedas Bertasius and Mohit Bansal},
year={2025},
eprint={2507.06485},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.06485},
}
