You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Configure the model and data path in ./scripts/collect_data.sh
sh ./scripts/collect_data.sh
We also provide the SFT and RL data in data/train_data/sft_qwen2.5_math_7B.json and data/train_data/rl_data_qwen2.5.jsonl for reproducing the result of Qwen2.5-Math-7B in our paper.
We also support using our datasets via Hugging Face now!
3. SFT Training 🔥
cd ./code/scripts
Configure your data and model path, then run:
sh ./code/scripts/train_sft.sh
4. Online RL Training 🚀
Configure your data and model path, then run:
sh ./code/scripts/train_rl.sh
Use the following config for outcome-level training:
--use_instance_level True
--kl_coef 0.01 # for Qwen2.5-Math_7B
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl # for Qwen2.5-Math_7B
Use the following config for process-level training:
--use_instance_level False
--kl_coef 0.05
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl # for Qwen2.5-Math_7B
5. Offline RL Training 💼
Rejection Sampling and Prompt Filtering
For offline sampling rollouts, run the following script to specify the prompt dataset (including problem and answer), the model path, and the storage path:
sh ./sample/sample_all.sh
The format of the prompt dataset should follow the reference:
./data/train_data/rl_data_offline.jsonl
Configure your data path, then run for rejection sampling and prompt filtering:
sh ./scripts/process_offline_trainset.sh
Training Script
Configure your data and model path, then run:
sh ./scripts/train_offline_rl.sh
6. Evaluation
Please refer to
./tools/qwen_eval/eval/README.md
🌟 Cite
@article{ma2025s,
title={S$^{2}$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning},
author={Ma, Ruotian and Wang, Peisong and Liu, Cheng and Liu, Xingyan and Chen, Jiaqi and Zhang, Bang and Zhou, Xin and Du, Nan and Li, Jia},
journal={arXiv preprint arXiv:2502.12853},
year={2025}
}