You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
¹Tsinghua University, ²Peking University, ³Fudan University, ⁴Jilin University, ⁵Microsoft Research Asia, ⁶Hong Kong University of Science and Technology, ⁷Zhejiang University
(*Equal Contribution, †Corresponding Author)
🎉 News
[2025.10] 📢📢 Paper and initial project release.
📝 To-Do List
Release Evaluation Code
Release the benchmark dataset on HuggingFace
MV-RoboBench
Benchmark Overview: We introduce MV-RoboBench, a benchmark designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic scenes. It contains [Number] question-answer pairs across [Number] diverse robotic scenes. The benchmark comprises [Number] challenging tasks, such as [Task 1 Name], [Task 2 Name], and [Task 3 Name]. These tasks are designed to probe various aspects of 3D scene understanding, from establishing object correspondences to understanding relative spatial poses.
📌 A Benchmark for Robotic Scenes: We introduce MV-RoboBench, a comprehensive benchmark designed to evaluate the spatial reasoning of Vision-Language Models in robotic scenes.
📊 Comprehensive Evaluation: We evaluate [Number] state-of-the-art VLMs, including models like GPT-4o and Claude 3, revealing a significant performance gap compared to human-level reasoning.
🔍 Revealing Core Challenges: Our analysis pinpoints key failure modes for current models in robotic scene understanding, particularly in cross-view correspondence, relative pose estimation, and action planning.
Contact
For any questions or suggestions, please feel free to contact Zhiyuan Feng or another author.