| CARVIEW |
Cross-Domain Reasoning Transfer
To understand how reasoning capabilities generalize with RL, we conducted controlled experiments using Guru. We investigated the impact of RL on single reasoning domains versus a mixed-domain corpus. An experimental dataset, Guru-18K (3K samples from each of the six domains), was used.
Differential Transferability
Math, Code, and Science benchmarks consistently improved significantly from training on other domains, possibly due to extensive exposure to these tokens during pretraining. Other domains showed limited cross-domain gains. Easier tasks within Math and Code showed positive transfer more readily than challenging benchmarks in the same domains. Mixed-domain training on a uniformly mixed dataset often matched or exceeded single-domain performance.
Reward and Response-Length Dynamics
In single-domain training, Code, Logic, and Tabular tasks saw contracted outputs, while Science and Math became more verbose. Joint training led to steep reward climbs initially and could reshape length dynamics.
Effects of Training Data Difficulty
| Math (in-domain) | Code & Tabular (cross-domain) | ||||||
|---|---|---|---|---|---|---|---|
| MATH500 | AMC | AIME24 | HumanEval | LiveCodeBench | HiTab | Multihiertt | |
| 75.8 | 52.1 | 15.8 | 82.3 | 11.1 | 56.5 | 32.0 | |
| 78.6 | 58.4 | 21.7 | 73.1 | 10.7 | 53.5 | 35.5 | |
| +2.8 | +6.3 | +5.9 | △ (+/-) | -9.2 | -0.4 | -3.0 | +3.5 |
Training on harder math data improved in-domain math performance but could degrade performance on easier cross-domain tasks. For beneficial cross-domain transfer, a balanced distribution of difficulties or explicit inclusion of cross-domain data may be more effective.
Data Construction
Experiment Results
We trained 7B and 32B models on the full Guru dataset to demonstrate the practical impact of multi-domain data. We used verl as the RL training framework and GRPO as the algorithm. The 7B model was trained for 2 epochs on 4 nodes (8 Hopper GPUs each) and the 32B model on 16 nodes for 2 epochs.
| Domain | Benchmarks | 7B | 32B | |||||
|---|---|---|---|---|---|---|---|---|
| Guru 7B | General Reasoner 7B | ORZ 7B | SimpleRL 7B | Guru 32B | ORZ 32B | SimpleRL 32B | ||
| Math | AIME24(avg@32) | 17.50 | 17.08 | 16.25 | 15.60 | 34.89 | 47.50 | 27.20 |
| MATH500 | 77.25 | 70.40 | 80.80 | 87.00 | 86.00 | 89.80 | 89.60 | |
| Code | LiveCodeBench(avg@4) | 16.49 | 8.51 | 5.47 | 6.72 | 29.30 | 22.04 | 19.80 |
| HumanEval(avg@4) | 82.62 | 61.12 | 67.38 | 58.08 | 90.85 | 84.30 | 81.25 | |
| MBPP | 70.00 | 39.80 | 48.40 | 49.60 | 78.80 | 74.20 | 76.75 | |
| Science | GPQA-diamond(avg@4) | 40.78 | 38.64 | 37.63 | 35.98 | 50.63 | 55.67 | 46.46 |
| SuperGPQA | 31.80 | 30.64 | 29.75 | 27.29 | 43.60 | 46.05 | 37.73 | |
| Logic | ARC-AGI(avg@4) | 3.31 | 0.75 | 0.00 | 0.50 | 7.63 | 2.31 | 5.25 |
| Zebra Puzzle(avg@4) | 39.40 | 0.07 | 1.00 | 0.62 | 45.21 | 0.54 | 1.16 | |
| Simulation | CodeI/O(avg@4) | 15.63 | 7.13 | 5.13 | 6.63 | 12.63 | 3.75 | 9.75 |
| CruxEval-I | 61.72 | 63.63 | 69.38 | 56.25 | 80.63 | 71.13 | 72.63 | |
| CruxEval-O | 71.28 | 56.50 | 65.88 | 58.31 | 88.75 | 82.38 | 67.75 | |
| Tabular | FinQA | 34.70 | 34.33 | 37.60 | 35.10 | 46.14 | 45.20 | 45.41 |
| HiTab | 74.20 | 54.40 | 54.10 | 50.40 | 82.00 | 63.30 | 69.00 | |
| MultiHiertt(avg@4) | 44.94 | 31.62 | 38.10 | 37.57 | 55.28 | 52.83 | 52.83 | |
| Others | IFEval | 35.81 | 39.56 | 32.72 | 36.69 | 55.45 | 38.26 | 55.27 |
| LiveBench | 18.57 | 19.76 | 12.64 | 15.20 | 34.30 | 28.78 | 28.33 | |
| Average Score | 43.29 | 33.76 | 35.42 | 33.97 | 54.24 | 47.53 | 46.25 | |
Pass@k Curves
Pass@k behavior is highly task-dependent: while improvements in math tasks (e.g., AIME) might largely leverage base model capabilities, tasks like Zebra Puzzle demonstrate genuine reasoning expansion. Model scale also matters—larger models (32B) show more consistent gains than smaller ones (7B). Additionally, decoding hyperparameters significantly affect Pass@k, with higher temperature and top-p enhancing exploration and performance at larger k. These insights suggest Pass@k reflects both model and sampling dynamics, and should be interpreted cautiously.
BibTeX
@misc{cheng2025revisiting,
title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective},
author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
journal = {arXiv preprint arXiv:2506.14965},
year = {2025},
doi = {10.48550/arXiv.2506.14965},
url = {https://arxiv.org/abs/2506.14965}
}