| CARVIEW |
Seeing from Another Perspective:
Evaluating Multi-View Understanding in MLLMs
1UC Berkeley, 2TranscEngram, 3NYU, 4University of Oxford, 5UC Davis, 6HKU
- *Equal Contribution
All-Angles Bench
Benchmark Overview: We introduce All-Angles Bench, a benchmark designed to evaluate the multi-view reasoning capabilities of MLLMs, containing 2,132 question-answer pairs carefully annotated across 90 diverse real-world scenes sourced from EGO4D-EXO and EgoHumans. All-Angles Bench comprises six challenging tasks including counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation. These question types are designed to investigate several major aspects of 3D scene understanding, from creating correspondence between objects to associating relative object and camera poses.
How Do We Make All-Angles Bench?
(1) Data Collection & Question Type Design: We curate 90 diverse multi-view scenes and design six tasks to evaluate multi-view reasoning. (2) Question Creation & Human Annotation: Using MLLMs for initial question generation, we refine and validate them through human annotation to ensure clarity, correctness, and relevance. (3) Paired-Question Generation & Human Quality Check: To assess cross-view consistency, we systematically rephrase or alter perspectives to generate paired questions while preserving visual correspondences with a final quality control.
Evaluation on All-Angles Bench
We consolidate performance from both closed-source and open-source MLLM evaluations.
We use deeper-gray to highlight the top result among all models in each sub-task, while
light-gray marks the second-best result.
As the primary results shown above, there remains a substantial performance gap between both of closed- and open-source MLLMs and human-level multi-view understanding.
We post several findings we observe.
While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle. In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash, Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing, often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.
Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet in object manipulation and relative direction. Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views. The success of open-source models suggests that domain-specific refinements, such as video-focused training, can enhance orientation and geometric reasoningâoffering insights for improving multi-view MLLMs.
Paired Q&As' Inconsistency on MLLMs
We classify model responses into three categories: CC (both correct), WW (both wrong), and IC (inconsistentâone correct, one wrong). High IC scores indicate weak multi-view understanding, where simple rewording leads to failure.
Evaluating six top MLLMs, we find: 1) GPT-4o has the highest IC score (~70%) on relative distance tasks, while others hover around 40%. 2) All models struggle with relative direction, exceeding 40% IC, showing difficulty with orientation shifts. 3) Gemini-2.0-Flash and Claude-3.7-Sonnet have balanced inconsistency across tasks, whereas Ovis2-34B and GPT-4o show significant task-based variability.
MLLMs Fail with Multi-View Correspondence
While MLLMs often succeed when everyone is visible in one viewpoint (Complete-Visiblity), they sometimes fail to reconcile fragmented information across views (Partial-Visiblity), as shown by GPTâ4o occasionally picks the largest perâview count rather than reconciling people across views.
We evaluate 1) Zero-Shot CoT, 2) Self-Consistency, and 3) Identification CoT on GPT-4o, Ovis2-34B, and InternVL2.5-38B under complete- and partial-view settings.
While CoT improves GPT-4o in partial-visibility cases, its impact is minimal on models already strong in multi-view counting (e.g., InternVL2.5-38B). This suggests that prompt refinement alone is insufficientâspecialized multi-view training is needed to excel in All-Angles Bench.
MLLMs Fail with Coarse Camera Estimation
While GPT-4o and Gemini-2.0-Flash perform moderately well in single-view scene reconstruction, they struggle with aligning different camera perspectives. Errors in camera pose estimation lead to incorrect directional reasoning, impacting multi-view consistency in MLLMs.
BibTeX
@article{yeh2025seeing,
title={Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs},
author={Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao and Yi Ma},
journal={arXiv preprint arXiv:2504.15280},
year={2025}
}