Jaewoo Lee | Archiki Prasad | Justin Chih-Yao Chen | Zaid Khan | Elias Stengel-Eskin | Mohit Bansal
Long-horizon information-seeking tasks require agents to gather and synthesize information across multiple reasoning steps and tool interactions. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs cannot capture richer dimensions of information-seeking steps nor handle the rapidly growing context in long-horizon tasks. We propose PRInTS (Process Reward via Information gain scoring and Trajectory Summary), a generative PRM jointly trained with two key abilities for fine-grained guidance under the challenge of context accumulation.
🎯 PRInTS as a scorer: evaluates agent's multiple candidate next trajectory steps based on the summarized context and current tool response, and outputs dense scores based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness)
📝 PRInTS as a summarizer: recursively updates a compact information-seeking trajectory summary to keep input length bounded and preserve key information for its subsequent score evaluation.
Please follow the installation instructions from verl.
Our data annotation pipeline is based on Inspect Eval evaluation framework. Please follow the installation isntructions from Inspect Eval. Download the QA corpus from MiroVerse and webagent families, and store them in /webagent_corpus_directory directory.
For scoring annotation, run
cd inspect_evals
inspect eval inspect_evals/webagent Save the score annotation logs into /annotated_data_dir/annotation_raw_trajectory.json, and run
python preprocess_trajectory.pyFor summary annotation, run
inspect eval inspect_evals/summary_generatorSave the summary annotation logs into /annotated_data_dir/annotation_raw_trajectory_summary.json, and run
python preprocess_trajectory_summary.pyNow construct datasets for both GRPO and SFT
cd ..
python examples/data_preprocess/prints_grpo_dataset.py --data_path /annotated_data_dir/annotated_sample_summary.json --local_dir benchmarks/PRInTS_infogain_annotation --tokenizer_path Qwen/Qwen3-4B --max_prompt_length 6144 --use_scoring --use_comparison
python examples/data_preprocess/prints_sftdataset.py --data_path /annotated_data_dir/annotated_sample_summary.json --local_dir benchmarks/PRInTS_summary_annotation --tokenizer_path Qwen/Qwen3-4B --max_prompt_length 8192Download our PRInTS from huggingface:
| Model | Download Link |
|---|---|
| PRInTS |
We train PRInTS on Qwen3-4B with our alternating SFT-GRPO training schedule.
bash examples/grpo_trainer/run_qwen3-4b_PRInTS_iterative_lr1e6.shFor evaluation we use the Inspect Eval evaluation pipeline and implement FRAMES, GAIA, and WebWalkerQA on top of the framework.
@article{lee2025prints,
title={PRInTS: Reward Modeling for Long-Horizon Information Seeking},
author={Jaewoo Lee and Archiki Prasad and Justin Chih-Yao Chen and Zaid Khan and Elias Stengel-Eskin and Mohit Bansal},
year={2025},
journal={arXiv preprint arXiv:2511.19314},
url={https://arxiv.org/abs/2511.19314},
}
