You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose a perception-guided 2D prompting strategy, Struct2D Prompting, and conduct a
detailed zero-shot analysis that reveals MLLMs’ ability to perform 3D spatial reasoning from
structured 2D inputs alone.
We introduce Struct2D-Set, a large-scale instructional tuning dataset with automatically generated, fine-grained QA pairs covering eight spatial reasoning categories grounded in 3D scenes.
We fine-tune an open-source MLLM to achieve competitive performance across several spatial
reasoning benchmarks, validating the real-world applicability of our framework.
If you find Struct2D helpful in your research, please consider citing:
@article{zhu2025struct2d,
title={Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs},
author={Zhu, Fangrui and Wang, Hanhui and Xie, Yiming and Gu, Jing and Ding, Tianye and Yang, Jianwei and Jiang, Huaizu},
journal={arXiv preprint arXiv:2506.04220},
year={2025}
}
🙏 Acknowledgement
We thank the authors of GPT4Scene, LLaMA-Factory for inspiring discussions and open-sourcing their codebases.
About
Code release for 'Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs' (NeurIPS 2025)