| CARVIEW |
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
Pei-Lin Li1, Xinjie Lin1, Jinnian Zhang2, Xin-Sheng Chen1, Yi Zhang1
Kiyohiro Nakayama3, Zhengyang Geng4, Houwen Peng2, Han Hu2, Shi-Min Hu1
1 Tsinghua University, 2 Tencent Hunyuan X, 3 Stanford University, 4 Carnegie Mellon University
R denotes reasoning, and V denotes vision-indispensable.
According to statistics on RBench-V, the benchmark spans 4 categories, which are math, physics, counting and game.
It features 803 questions centered on multi-modal outputs,
which requires image manipulation,
such as generating novel images and constructing auxiliary lines to support reasoning process.
Comparison with different benchmarks such as MMLU, MMMU, etc.
RBench-V assesses models' ability to generate multi-modal outputs during visual reasoning.
Solving problems in RBench-V requires producing outputs beyond text,
such as drawing geometric figures (top-right example) or tracing path through a maze (bottom-right example).
Introduction
The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini and o3 with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking process (a.k.a., multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning process while neglecting the importance of reasoning through multi-modal outputs.
In this paper, we present a benchmark, dubbed as RBench-V, designed to assess modelsโ multi-modal reasoning. To conduct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting and games. Unlike problems in previous benchmarks, which typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.
We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, which shows current models struggle to leverage multi-modal reasoning.
Leaderboard for Open- and Closed-Source Models
Overview
Leaderboard
| Model | Source | Overall | w/o Math | Math | Physics | Counting | Game |
| Human Expert ๐ | / | 82.3 | 81.7 | 84.7 | 69.4 | 81.0 | 89.1 |
| DreamPRM-1.5 (GPT-5-mini)* ๐ฅ | Link | 31.3 | 26.0 | 50.0 | 38.9 | 24.6 | 19.6 |
| GPT-5-mini ๐ฅ | Link | 27.9 | 22.2 | 48.3 | 31.8 | 22.6 | 16.4 |
| OpenAI o3 ๐ฅ | Link | 25.8 | 19.5 | 48.3 | 20.4 | 22.1 | 17.1 |
| OpenAI o4-mini | Link | 20.9 | 14.6 | 43.2 | 12.7 | 17.4 | 13.8 |
| Gemini 2.5 pro-preview-0506 | Link | 20.2 | 13.9 | 42.6 | 9.6 | 19.0 | 12.7 |
| Doubao-1.5-thinking-pro-m | Link | 17.1 | 11.0 | 38.6 | 13.4 | 9.7 | 10.5 |
| OpenAI o1 | Link | 16.2 | 11.0 | 34.7 | 5.7 | 12.3 | 13.1 |
| Doubao-1.5-vision-pro | Link | 15.6 | 11.5 | 30.1 | 8.9 | 12.8 | 12.0 |
| OpenAI GPT-4o-20250327 | Link | 14.1 | 11.2 | 24.4 | 3.2 | 13.3 | 14.2 |
| OpenAI GPT-4.1 | Link | 13.6 | 11.7 | 20.5 | 5.7 | 11.3 | 15.3 |
| Step-R1-V-Mini | Link | 13.2 | 8.8 | 29.0 | 6.4 | 10.3 | 9.1 |
| OpenAI GPT-4.5 | Link | 12.6 | 11.0 | 18.2 | 2.5 | 11.8 | 15.3 |
| Claude-3.7-sonnet | Link | 11.5 | 9.1 | 19.9 | 3.8 | 8.7 | 12.4 |
| JT-VL-Chat-Thinking-20251015 | Link | 11.1 | 8.3 | 21.6 | 1.9 | 9.2 | 10.9 |
| QVQ-Max | Link | 11.0 | 8.1 | 21.0 | 5.7 | 6.2 | 10.9 |
| Qwen2.5VL-72B | Link | 10.6 | 9.2 | 15.3 | 3.8 | 6.2 | 14.5 |
| InternVL-3-38B | Link | 10.0 | 7.2 | 20.5 | 0.6 | 5.1 | 12.4 |
| Qwen2.5VL-32B | Link | 10.0 | 6.4 | 22.7 | 2.5 | 4.1 | 10.2 |
| MiniCPM-2.6-o | Link | 9.7 | 7.5 | 17.6 | 1.3 | 3.6 | 13.8 |
| Llama4-Scout (109B MoE) | Link | 9.5 | 6.9 | 18.8 | 3.2 | 4.1 | 10.9 |
| MiniCPM-2.6-V | Link | 9.1 | 7.2 | 15.9 | 1.3 | 6.2 | 11.3 |
| LLaVA-OneVision-72B | Link | 9.0 | 8.9 | 9.1 | 4.5 | 4.6 | 14.5 |
| DeepSeek-VL2 | Link | 9.0 | 7.0 | 15.9 | 0.6 | 5.6 | 11.6 |
| LLaVA-OneVision-7B | Link | 8.5 | 6.8 | 14.2 | 2.5 | 4.6 | 10.9 |
| Qwen2.5VL-7B | Link | 8.3 | 7.0 | 13.1 | 2.5 | 3.6 | 12.0 |
| InternVL-3-8B | Link | 8.2 | 6.0 | 15.9 | 1.9 | 5.6 | 8.7 |
| InternVL-3-14B | Link | 8.0 | 7.0 | 11.4 | 1.3 | 5.1 | 11.6 |
| Qwen2.5-Omni-7B | Link | 7.7 | 4.5 | 11.4 | 1.9 | 2.1 | 7.7 |
* Results obtained under "Best-of-4 + PRM selection": for each test instance, four reasoning trajectories are generated, and the Process Reward Model (PRM) selects the most coherent one.
RBench-V
Examples
Examples of o3's responses to math and game questions in RBench-V:
Left: o3 solves a math question in RBench-V by converting a geometry problem into algebra using coordinates,
unlike humans who use geometric reasoning.
Right: o3 fails a game question by not following instructions to draw required connections, as highlighted in blue.
BibTeX
@inproceedings{
guo2025rbenchv,
title={RBench-V: A Primary Assessment for Visual Reasoning Models
with Multi-modal Outputs},
author={Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen,
Pei-Lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama,
Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu},
year={2025},
eprint={2505.16770},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.16770},
}