| CARVIEW |
taxonomy · benchmark · Ability Study
SpatialTree : How Spatial Abilities
Branch Out in MLLMs
Yuxi Xiao▲,★,*, Longfei Li♦,★,*, Shen Yan★, Xinhang Liu★, Sida Peng▲,
Yunchao Wei♦,
Xiaowei Zhou▲, Bingyi Kang★,†
▲Zhejiang University · ★ByteDance Seed · ♦Beijing Jiaotong University
*Equal Contribution †Project Lead
How does spatial intelligence emerge?
L1 Perception
L2 Mental Mapping
L3 Mental Simulation
L4 Spatial Agent
Native perception
Native perception of space.
Language
Mapping spatial concepts to language.
Thinking
Thinking about space in the mind.
Actions
Executing actions in space.
“Each stage builds on the stage before it.”
— Jean Piaget’s cognitive development theory
Overview
We propose SpatialTree, a cognitive-science-inspired taxonomy of spatial intelligence in multimodal large language models(MLLMs), along with SpatialTree-Bench and a proof-of-concept study on improving spatial abilities.
- Dependencies Structure: Low-level abilities are relatively orthogonal, whereas higher-level abilities are increasingly compositional and coupled.
- Ability Acquisition: Exhaustive supervised fine-tuning (SFT) over a broad range of low-level perceptual data can maximize cross-level transfer, benefiting higher-level spatial abilities.
- Auto-thinking: Naive post-training with RLVR does not work; explicitly encouraging reasoning at higher levels while suppressing it for low-level perceptual skills makes broad improvements.
Cognitive Hierarchy
We design the SpatialTree taxonomy, inspired by cognitive science, to scaffold spatial abilities across four hierarchical levels.
Click a layer to see its focus, representative abilities.
L4 · Agentic Competence
Converts perception + memory into executable behaviors through the Spatial Action Mapping, covering goal execution, open-world exploration, and manipulation.
Data Samples
SpatialTree-Bench
SpatialTree-Bench measures 16 closed/open MLLMs across L1–L4 with balanced weighting, revealing orthogonal perception atoms and tightly coupled higher layers.
Distribution of Benchmark Categories
| Model | Avg. | Perception | Mental Mapping | Mental Simulation | Agentic Competence |
|---|---|---|---|---|---|
| Gemini 2.5 Pro | 50.1 | 49.3 | 53.8 | 51.3 | 45.8 |
| Gemini 2.5 Pro NT | 41.4 | 38.8 | 46.9 | 42.4 | 37.6 |
| Qwen3VL-235B | 40.0 | 34.6 | 46.6 | 39.9 | 38.9 |
| Gemini 2.5 Flash | 39.0 | 38.3 | 46.3 | 37.8 | 33.5 |
| GLM4.5V | 36.0 | 35.0 | 34.4 | 36.3 | 38.3 |
| Gemini 2.5 Flash NT | 35.8 | 33.5 | 41.6 | 31.4 | 36.7 |
| Seed1.6-Vision | 35.7 | 36.3 | 36.6 | 35.4 | 34.6 |
| Qwen3VL-30B | 35.3 | 30.2 | 40.9 | 36.7 | 33.4 |
| Qwen 2.5VL-72B | 33.0 | 28.8 | 37.8 | 34.6 | 31.3 |
| GPT-4o | 31.9 | 26.8 | 34.9 | 33.2 | 32.5 |
| Qwen 2.5VL-32B | 27.9 | 24.0 | 26.8 | 34.3 | 26.4 |
| Qwen 2.5VL-7B | 27.5 | 20.9 | 28.9 | 32.4 | 27.8 |
| Kimi-VL-A3B | 24.4 | 20.3 | 25.9 | 27.2 | 24.2 |
Hierarchy insight: L1 abilities are largely orthogonal, while L3 and L4 scores are strongly correlated.
Ability Study
Ability Dependency Analysis
(Region A + Region B): Low-level abilities are largely orthogonal, while high-level abilities are tightly coupled through integration.
(Region C): Low-level abilities form clear prerequisites that higher-level abilities depend on.
Finding 1
Cross-Ability Transfer: Single-ability L1 SFT induces cross-level transfer, while yielding limited or slightly negative effects on same-level abilities.
Finding 2
Multi-ability Synergy: The holistic integration across multiple fundamental abilities achieves synergistic gains far exceeding their individual effects.
| Methods | Avg. | L1 Perception | L2 Mental Mapping | L3 Mental Simulation | L4 Agentic Competence | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Geom. | Motion | Rel. | Local. | Orient. | Underst. | Memory | Caus. Reas. | Seq. Plan. | Goal Exec. | Open Expl. | ||
| Baseline | 25.0 | 20.9 | 28.6 | 28.9 | 24.2 | 34.2 | 22.6 | 21.7 | 27.2 | 31.7 | 22.1 | 26.5 |
| B+Dist. | 24.5 | 24.1 +3.2 | 26.6 -2.0 | 23.2 -5.8 | 19.6 -4.6 | 34.3 +0.1 | 24.6 +2.0 | 21.8 +0.1 | 26.1 -1.1 | 30.8 -0.9 | 25.5 +3.4 | 26.1 -0.4 |
| B+Corr. | 25.2 | 17.6 -3.2 | 23.9 -4.7 | 30.2 +1.3 | 18.9 -5.3 | 35.6 +1.4 | 21.9 -0.7 | 24.6 +2.9 | 21.8 -5.4 | 33.9 +2.2 | 24.7 +2.6 | 35.9 +9.4 |
| B+Size | 23.5 | 24.3 +3.4 | 22.6 -6.0 | 21.4 -7.5 | 21.7 -2.5 | 34.5 +0.3 | 21.9 -0.8 | 19.2 -2.5 | 23.4 -3.8 | 30.3 -1.5 | 21.5 -0.6 | 24.3 +2.2 |
| B+Dist.+Size+Corr. | 26.1 | 25.5 +4.6 | 29.3 +0.7 | 29.4 +0.5 | 16.4 -7.8 | 33.7 +0.5 | 23.0 +0.4 | 24.2 +2.5 | 25.2 -2.0 | 34.2 +2.5 | 26.0 +3.9 | 28.5 +2.0 |
| B+Dist.+Size+Corr.+Mot. | 27.3 | 28.6 +7.7 | 24.6 -4.0 | 20.6 -8.3 | 26.3 +2.1 | 36.0 +1.8 | 22.2 -0.4 | 22.6 +0.9 | 28.2 +1.0 | 32.8 +1.1 | 23.3 +1.1 | 35.9 +9.4 |
| Baseline+75@(all spat.) | 23.6 | 24.9 +4.0 | 22.6 -6.0 | 25.9 -3.0 | 17.4 -6.8 | 31.2 -3.0 | 22.2 -0.4 | 20.6 -1.1 | 25.7 -1.5 | 30.2 -1.5 | 19.7 -2.4 | 22.8 -3.7 |
SFT Comparisons. "B+Dist.", "B+Corr.", and "B+Size" denote the baseline augmented with distance, correspondence, and size tuning data, respectively. Changes are color-coded as notable gains, neutral influence, and drops.
Finding 3
Auto-Think Strategy: We uncover a trade-off in RL where excessive reasoning hurts intuitive perception. We propose a hierarchy-aware "Auto-think" strategy that suppresses reasoning for low-level sensing while encouraging it for complex planning.
| Methods | Avg. | L1 Perception | L2 Mental Mapping | L3 Mental Simulation | L4 Agentic Competence | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Geom. | Motion | Rel. | Local. | Orient. | Underst. | Memory | Caus. Reas. | Seq. Plan. | Goal Exec. | Open Expl. | ||
| SFT Baseline | 27.3 | 28.6 | 24.6 | 20.6 | 26.3 | 36.0 | 22.2 | 22.6 | 28.2 | 32.8 | 23.3 | 35.9 |
| L1 RL@think | 26.6 -0.7 | 28.4 +0.9 | 30.0 +5.4 | 30.5 +9.9 | 19.5 -6.8 | 34.5 -1.4 | 24.9 +2.8 | 18.5 -4.1 | 25.7 -2.4 | 34.0 +1.1 | 24.1 +0.8 | 29.6 -6.3 |
| L2 RL@think | 26.7 -0.5 | 24.4 -4.1 | 22.7 -1.9 | 22.3 +1.6 | 17.0 -9.4 | 35.0 -0.9 | 26.6 +4.5 | 16.1 -6.5 | 29.0 +0.8 | 31.4 -1.4 | 25.6 +2.3 | 34.5 -1.4 |
| L3 RL@think | 27.7 +0.4 | 16.2 -12.3 | 24.0 -0.6 | 24.1 +3.5 | 21.4 -4.9 | 38.5 +2.5 | 26.0 +3.9 | 21.8 -0.8 | 31.3 +3.1 | 34.7 +1.9 | 26.5 +3.2 | 38.4 +2.5 |
| L4 RL@think | 28.5 +1.2 | 23.8 -4.8 | 25.3 +0.7 | 22.1 +1.4 | 23.5 -2.8 | 33.9 -2.1 | 25.1 +3.0 | 20.5 -2.2 | 32.0 +3.8 | 34.2 +1.4 | 27.1 +3.9 | 38.8 +2.9 |
| Full RL@think | 30.1 +2.9 | 29.7 +1.1 | 24.7 +0.1 | 27.2 +6.5 | 21.0 -5.3 | 34.8 -1.2 | 27.4 +5.2 | 16.7 -5.9 | 33.6 +5.5 | 37.6 +4.8 | 25.4 +2.1 | 41.7 +5.8 |
| Full RL@auto-think | 30.8 +3.6 | 31.9 +3.3 | 28.6 +4.0 | 22.0 +1.3 | 23.1 -3.2 | 36.8 +0.8 | 28.0 +5.8 | 22.6 -0.1 | 33.5 +5.4 | 35.6 +2.8 | 23.4 +0.1 | 44.1 +8.3 |
RLVR Comparisons. The table compares the SFT baseline with versions enhanced by RL on different tasks. Changes are color-coded as notable gains, neutral influence, and drops.
Citation
If you find our project helpful to your research, you can cite us with:
@article{spatialtree2025,
title={SpatialTree: How Spatial Abilities Branch Out in MLLMs},
author={Xiao, Yuxi and Li, Longfei and Yan, Shen and Liu, Xinhang and Peng, Sida and Wei, Yunchao and Zhou, Xiaowei and Kang, Bingyi},
journal={arXiv preprint},
year={2025}
}