You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
N3D-VLM is a unified vision-language model for native 3D grounding and 3D spatial reasoning. By incorporating native 3D grounding, our model enables precise spatial reasoning, allowing users to query object relationships, distances, and attributes directly within complex 3D environments.
Updates
2025/12/19: We released this repo with the pre-trained model and inference code.
We provide three examples for inference of N3D-VLM. You could check the source files in data directory, where *.jpg are the source images and *.npz are the monocular point clouds obtained by using MoGe2.
# inference
python demo.py
Demo 1
rotate-22.mp4
Demo 2
Demo.2.mp4
Demo 3
Demo.3.mp4
After running the code above, the inference results will be saved in the outputs directory, including generated answers in *.json format, and 3D grounding results in *.rrd format.
The rrd files can be visualized by using Rerun:
rerun outputs/demo1.rrd
If you want to do the 3D Detection only, please check the example as below.
@article{wang2025n3d,
title={N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models},
author={Wang, Yuxin and Ke, Lei and Zhang, Boqiang and Qu, Tianyuan and Yu, Hanxun and Huang, Zhenpeng and Yu, Meng and Xu, Dan and Yu, Dong},
journal={arXiv preprint arXiv:2512.16561},
year={2025}
}
About
Official code for paper: N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models