| CARVIEW |
SITE: towards
Spatial Intelligence Thorough Evaluation
Zhengyuan Yang2, Lijuan Wang2, Andrey Kolobov2, Jianfeng Gao2, Boqing Gong1,
(To appear at ICCV 2025)
Introduction
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards Spatial Intelligence Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models's spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
Leaderboard on SITE benchmark
Chance-Adjusted Accuracy* scores on the (8,068 examples) SITE benchmark.
| Model | Overall | Count | Loc | 3D Inf | MultiV | Rel | Mov |
| Random | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Tiny Subset | |||||||
| Human | 67.5 | 66.0 | 83.3 | 54.7 | 87.5 | 73.0 | 52.5 |
| InternVL-2.5-8B | 34.3 | 48.5 | 46.8 | 9.32 | 8.51 | 45.6 | 23.7 |
| GPT-4o | 35.6 | 42.4 | 51.2 | 11.0 | 17.8 | 42.7 | 19.5 |
| Open-source | |||||||
| InternVL-2.5-8B | 32.8 | 47.1 | 37.0 | 23.2 | 9.05 | 47.6 | 28.7 |
| Qwen2.5-VL-7B | 31.4 | 52.6 | 44.1 | 9.42 | 1.08 | 51.5 | 18.9 |
| LLAVA-OV-7B | 30.2 | 51.8 | 38.5 | 22.4 | 9.40 | 55.3 | 9.18 |
| Qwen2.5-VL-3B | 29.5 | 45.6 | 37.5 | 13.2 | 7.14 | 45.6 | 18.8 |
| InternVL-2.5-4B | 29.4 | 47.9 | 32.9 | 11.4 | 3.94 | 47.2 | 22.9 |
| Phi-3.5-Vision | 21.8 | 33.2 | 34.0 | 11.7 | 3.33 | 32.8 | 11.7 |
| LLAVA-OV-0.5B | 18.4 | 28.0 | 32.3 | 5.67 | 3.77 | 30.6 | 4.70 |
| Proprietary | |||||||
| GPT-4o | 37.8 | 44.6 | 56.0 | 26.9 | 22.0 | 54.6 | 18.4 |
| Gemini-1.5-Pro | 32.5 | 48.0 | 45.8 | 25.3 | 5.33 | 48.8 | 18.4 |
Chance-Adjusted Accuracy*: Subtract the chance level from the raw accuracy score so that 0 means "just chance" and 1 means "perfect".
Spatial Intelligence Categories: Count: Counting and Existence, Loc: Localization and Positioning, 3D Inf: 3D Information Understanding,
MultiV: Multi-View and Cross-Image Reasoning, Rel: Spatial Relationship Reasoning, Mov: Movement Prediction and Navigation
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
Examples
Two example for each spatial intelligence category.
Multi-View and Cross-Image Reasoning
Spatial Relationship Reasoning
Movement Prediction and Navigation
Counting and Existence
Localization and Positioning
3D Information Understanding
New Tasks
Ego-Exo view association tasks. The goal of this task is pick the correct exocentric view given the egocentric view of a visual scene or vice versa.
Ego-Exo frames reordering tasks. Given the start and end frames of a video clip in an egocentric view, and four randomly shuffled frames from the same clip in exocentric views, the model is tasked with reordering the four shuffled frames into their correct temporal sequence (or vice versa).
Statistics
Left: Import statistics of SITE benchmark.
Right: Our final benchmark category distribution.
Different models's performance in a glance.
Comparison
SITE vs. similar efforts on benchmarking spatial intelligence.
Correlation between different benchmarks and robotics manipulation on Libero Spatial. To assess the downstream utility of our benchmark in embodied tasks, we collect performance scores from four vision-language models—Qwen2.5-VL and InternVL-2.5 series—across various benchmarks, along with their corresponding performances on LIBERO-Spatial when used as VLA backbones. The bold numbers show a higher correlation.
BibTeX
@misc{wang2025sitespatialintelligencethorough,
title={SITE: towards Spatial Intelligence Thorough Evaluation},
author={Wenqi Wang and Reuben Tan and Pengyue Zhu and Jianwei Yang and Zhengyuan Yang and Lijuan Wang and Andrey Kolobov and Jianfeng Gao and Boqing Gong},
year={2025},
eprint={2505.05456},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05456},
}