You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Probench collects competition problems from codeforces, luogu, and nowcoder to evaluate models' code reasoning capabilities in competitive programming. It ensures code robustness through online code evaluation, while also providing comprehensive analysis of models' code reasoning abilities.
Usage
Data. We have provided all problem descriptions in the codeforce, luogu, and nowcoder folders, and the statistical information for these problems is displayed in pred/problem_list.json.
Get Responses. First, you need to use your model to generate responses and solution code based on the provided problems. We offer code for generation using vLLM and APIs (e.g., OpenAI) in pred/get_response.py. If you have alternative code, you can refer to generate_prompts and save_response in pred/utils.py to ensure a unified output format.
Code Evaluation. Due to platform restrictions, we cannot publicly share submission scripts. You can send the model-generated data from data/model to yl.shadow.yl@gmail.com and contact us. We will return the evaluation results of the code as soon as possible.
Leaderboard
Rank
Model
Size
Reasoning
Pass@1
1
QwQ-32B-Preview
32B
1
20.93
2
DeepSeek-V3
37/671B
0
16.38
3
Qwen2.5-72B-Instruct
72B
0
11.50
4
Mistral-Large-Instruct-2411
123B
0
10.54
5
Qwen2.5-Coder-32B-Instruct
32B
0
9.48
6
Llama-3.1-70B-Instruct
70B
0
7.99
7
Codestral-22B-v0.1
22B
0
5.08
8
Skywork-o1-Open-Llama-3.1-8B
8B
1
5.06
9
Mixtral-8x22B-Instruct-v0.1
22/176B
0
4.27
Citation
@article{yang2025probench,
title={ProBench: Benchmarking Large Language Models in Competitive Programming},
author={Yang, Lei and Jin, Renren and Shi, Ling and Peng, Jianxiang and Chen, Yue and Xiong, Deyi},
journal={arXiv preprint arXiv:2502.20868},
year={2025}
}
About
ProBench: Benchmarking Large Language Models in Competitive Programming