We built the training and evaluation environments based on OpenRLHF and VLLM. We packaged our conda environment using conda-pack and uploaded it to HuggingFace. You can use the following bash command to build the training and evaluation environments.
First, use the command which conda to check the local installation path of Conda ( {conda_path} ).
huggingface-cli download Lux0926/ASPRM-Training-Evaluation-Environment --local-dir {your_save_path}
mkdir /{conda_path}/asprm_train
cd {your_save_path}
tar -xzvf asprm_train.tar.gz -C /{conda_path}/asprm_trainmkdir /{conda_path}/asprm_eval
cd {your_save_path}
tar -xzvf asprm_eval.tar.gz -C /{conda_path}/asprm_evalIn the train folder, we have provided the scripts used for training PRM. To replicate our training process, please run the scripts in the train directory after setting up the training environment. We use the LeetCodeDataset to perform supervised fine-tuning (SFT) on deepseek-coder-6.7b-base.
cd train/
conda activate asprm_train
bash train_ASPRM-M.shCUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --module openrlhf.cli.train_prm \
--save_path ./checkpoint/{your_prm_name} \
--save_steps 1000 \
--logging_steps 10 \
--eval_steps 1000 \
--train_batch_size 32 \
--micro_train_batch_size 4 \
--pretrain {path_to_your_basemodel} \
--bf16 \
--max_epochs 1 \
--max_len 8192 \
--zero_stage 2 \
--learning_rate 5e-6 \
--dataset {path_to_prm_train_dataset} \
--input_key query \
--label_key response \
--flash_attn \
--packing_samples \
--wandb_group prm \
--placeholder_token " ки" \We did not use the -reward_tokens parameter, as omitting it typically leads to better performance.
Please goto the evaluation/ folder.
To reproduce the BON evaluation results, please go to the evaluation/math/BON folder. First, use the run_all_eval_server.sh script to specify the PRM and start the PRM server.
cd evaluation/math/BON
conda activate asprm_eval
bash run_all_eval_server.shIn the run_all_eval_server.sh, you can specify the parameters as shown below.
bash {run_eval_server_llama.sh/run_eval_server_mistral.sh/run_eval_server_er_prm.sh/run_eval_server_shepherd.sh} {CUDA_VISIBLE_DEVICES} {prm_path} {prm_server_port} &After starting the PRM server, run the run_eval_code.sh script to initiate the BON evaluation. You can specify the following parameters within it.
python eval.py --benchmark_type {llama_data/mistral_data} --bon_size {4/8/16/32/64} --input_data_path {BON_Evaluation_Data_Path} --reward_port {running_prm_server_port} --eval_type {confidence/hard/random} --dataset_type {gsm8k/math_500} --prm_model_path {prm_model_path} &The complete BON evaluation dataset can be found here, and the evaluation results will be saved in math_result.jsonl. Note that in the BON evaluation dataset, data with the with_step suffix is used to evaluate the performance of Shepherd, while data with the remove_enter suffix is used to evaluate the performance of ER-PRM.
To reproduce the TVD evaluation results, please go to the evaluation/math/TVD folder.
First, run the run_all_server.sh script to start the task_server and reward_server.
bash run_all_server.shYou can specify the following parameters.
# Task model
bash run_task_server.sh {CUDA_VISIBLE_DEVICES} {task_model_path} {task_model_server_port} {task_model_path} &
# PRM model
bash {run_reward_server_mistral.sh/run_reward_server_llama.sh/run_reward_server_er_prm.sh/run_reward_server_shepherd.sh} {CUDA_VISIBLE_DEVICES} {prm_model_path} {prm_model_server_port} &Then, run the run_tvd.sh script to initiate the TVD evaluation.
python tvd.py \
--temperature {temperature} \
--bon_size {bon_size:4/8/...} \
--confidence_threshold {confidence_threshold:0-100} \
--task_model_path {task_model_path} \
--reward_model_path {prm_model_path} \
--input_data_path {GSM8k_test_idx.jsonl/GSM_symbolic_test_idx.jsonl/math500_test_with_idx.jsonl} \
--output_data_path {prm_name}_con{confidence_threshold}_{task_model_name}.jsonl \
--task_port {task_model_server_port} \
--reward_port {prm_model_server_port} &
waitUpon completion of the run, use get_score.ipynb to retrieve the final score for the TVD evaluation.
The evaluation of Code is similar to that of Math, but all final scores are obtained offline.
The complete BON evaluation dataset can be found here. First, similarly use the run_all_eval_server.sh script to specify the PRM and start the PRM server.
# run_all_eval_server.sh
bash run_eval_server_ds.sh {CUDA_VISIBLE_DEVICES} {prm/orm_path} {prm_server_port} &Also after starting the PRM server, run the run_eval_code.sh script to initiate the BON evaluation.
# run_eval_code.sh
python {eval_lct.py/eval_lcb.py} --benchmark_type {ds_data/ds_data_orm} --bon_size {4/8/16/32/64} --input_data_path {BON_Evaluation_Data_Path} --reward_port {running_prm_server_port} --eval_type {confidence/hard/random} --dataset_type {leetCoTE/LiveCodeBench} --prm_model_path {prm_model_path} &Upon completion, all BON evaluation results will be saved.
Similarly,run the run_all_server.sh script to start the task_server and reward_server.
# Task model
bash run_task_server.sh {CUDA_VISIBLE_DEVICES} {task_model_path} {task_model_server_port} {task_model_path} &
# PRM model
bash run_reward_server_ds.sh {CUDA_VISIBLE_DEVICES} {prm_model_path} {prm_model_server_port} &Then, to evaluate performance on the LCB dataset, run run_tvd_lcb.sh.
# run_tvd_lcb.sh
(python tvd_lcb.py \
--temperature {temperature} \
--bon_size {bon_size:4/8/...} \
--confidence_threshold {confidence_threshold:0-100} \
--task_model_path {task_model_path} \
--reward_model_path {prm_model_path} \
--input_data_path livecodebench_test_idx.jsonl \
--output_data_path prm_con{confidence_threshold}_lcb.jsonl \
--task_port {task_model_server_port} \
--reward_port {prm_model_server_port} ) &
waitIf you want to evaluate performance on the LCT dataset, run run_tvd_lct.sh.
# run_tvd_lct.sh
(python tvd_lct.py \
--temperature {temperature} \
--bon_size {bon_size:4/8/...} \
--confidence_threshold {confidence_threshold:0-100} \
--task_model_path {task_model_path} \
--reward_model_path {prm_model_path} \
--input_data_path leetCoTE_test_idx.jsonl \
--output_data_path prm_con{confidence_threshold}_lcb.jsonl \
--task_port {task_model_server_port} \
--reward_port {prm_model_server_port} ) &
waitUpon completion, all TVD evaluation results will be saved.
Both BON and TVD evaluations will save the evaluation result files. To obtain the final scores for the evaluation on the LCT dataset, please refer to the following instructions.
git clone https://github.com/Lux0926/ASPRM_LCT_Eval.git
cd ASPRM_LCT_Eval/To obtain the BON evaluation scores, run test_leetcode_eval.sh. To obtain the TVD evaluation scores, run test_leetcode_tvd.sh.
# test_leetcode_eval.sh
python src/main_eval.py --model_name "model" \
--task "LeetCodeTest" \
--save "sft_old_test" \
--num_gpus 4 \
--num_samples 1 \
--k 1 \
--temperature 0.0 \
--num_workers 32 \
--batch_size 200 \
--max_tokens 8192 \
--model_type "Chat" \
--prompt_type "Instruction" \
--prompt_prefix "" \
--prompt_suffix "" \
--trust_remote_code \
--input_file {your_evaluation_result_path} \
--output_file {your_evaluation_result_name}# test_leetcode_tvd.sh
python src/main_tvd.py --model_name "model" \
--task "LeetCodeTest" \
--save "sft_old_test" \
--num_gpus 4 \
--num_samples 1 \
--k 1 \
--temperature 0.0 \
--num_workers 32 \
--batch_size 200 \
--max_tokens 8192 \
--model_type "Chat" \
--prompt_type "Instruction" \
--prompt_prefix "" \
--prompt_suffix "" \
--trust_remote_code \
--input_file {your_evaluation_result_path} \
--output_file {your_evaluation_result_name}Before running these two scripts, simply modify the --input_file parameter to specify the file for which you want to obtain the scores. The --output_file can be the same as the --input_file. No other parameters need to be modified.
To obtain the final scores for the evaluation on the LCB dataset, please follow the official LiveCodeBench GitHub repository. Refer to the "Custom Evaluation" section of the repository.
python -m lcb_runner.runner.custom_evaluator --custom_output_file {your_evaluation_result_path} --release_version release_v4 &However, please remember that before doing so, you need to use process_lcb_result.ipynb to convert the evaluation results into the format required by LiveCodeBench.
If you find anything useful for your research, please consider citing.
@misc{liu2025adaptivestepautomaticallydividingreasoning,
title={AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence},
author={Yuliang Liu and Junjie Lu and Zhaoling Chen and Chaofeng Qu and Jason Klein Liu and Chonghan Liu and Zefan Cai and Yunhui Xia and Li Zhao and Jiang Bian and Chuheng Zhang and Wei Shen and Zhouhan Lin},
year={2025},
eprint={2502.13943},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.13943},
}