Carview!

Multi-modal Situated Reasoning in 3D Scenes

Xiongkun Linghu^✶, Jiangyong Huang^✶, Xuesong Niu^✶, Xiaojian Ma, Baoxiong Jia, Siyuan Huang

Data Distribution

MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes.

Model

MSR3D accepts 3D point cloud, text-image interleaved situation, location, orientation, and question as multi-modal input. It has a stronger situation modeling capability than LEO.

📖 Introduction

Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding suffer from severe limitations in data modality, scope, diversity, and scale.

To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (e.g., texts).

Additionally, we devise the Multi-modal Next-step Navigation (MSNN) benchmark to evaluate models' grounding of actions and transitions between situations. Comprehensive evaluations on reasoning and navigation tasks highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the effectiveness of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models, contributing to advancements in 3D scene understanding for embodied AI.

🔥 News

[2025-6] We release baseline code for GPT-4o to help evaluation for other opensourced multimodal LLMs. This baseline takes ground-truth object labels, locations and attributes as the scene input. You can replace the scene input with other modality such as RGB video frames.
[2025-2] We release MSR3D_v2 data, training code. This version of data remove some ambiguity in questions and answers.
[2025-2] We release the script to align the situation viewpoint between SQA3D and MSQA.
[2025-2] We provide the script to visualize the MSQA/MSNN data, including the situaitions.
[2024-10] We released the dataset, which has been structured to facilitate the evaluation of multimodal large language models (MLLMs).
[2024-9] 🎉 Our paper is accepted by NeurIPS 2024 Datasets and Benchmarks Track!
[2024-9] We released the paper of MSR3D. Please check the webpage.

🚀 Get Started

Clone Github repo.

git clone https://github.com/MSR3D/MSR3D.git
cd MSR3D

Create conda environment and install dependencies.

conda create -n msr3d python=3.9
conda activate msr3d
# install PyTorch, take our version for example
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install other dependencies with pip
pip install -r requirements.txt
# install peft separately to escape its install_requires
pip install peft==0.5.0 --no-deps

Install third party libraries (for point cloud backbones). Note that if the installation of PointNext fails, you can either 1) comment the line of importing PointNext in model/pcd_backbone.py or 2) download the compiled file and place it at model/pointnext/cpp/pointnet2_batch/, which may possibly help.

cd modules/third_party/pointnet2
# default PointNet++
python setup.py install
cd ..
cd ..
# sanity check
python -c 'from modules/layers/pointnet.py import PointNetPP'

📁 Prepare data

Scand data. MSR3D takes object centric point cloud as input. You can find the point cloud files in the pcds directory. You should unzip the point cloud files to the directories ${scan_family_base}, ${rscan_base} and ${ARkit_base}.
Scan data structure

├── ${scan_family_base}   # scannet
    ├── scan_data
    |     ├── pcd_with_global_alignment
    |     |     └── ${scan_id}.pth
    |     └── instance_id_to_label
    |           └── ${scan_id}_inst_to_label.pth
    └── annotations/splits/scannetv2_'+ split + "_sort.json"   # for MSNN split
├── ${rscan_base}
    └── 3RScan-ours-align
        ├── ${scan_id}
            ├── pcds.pth
            └── inst_to_label.pth
├── ${ARkit_base}
    └── scan_data
        ├── pcd-align
        |  └──${scan_id}.pth
        └── instance_id_to_label
              └── ${scan_id}_inst_to_label.pth

Object images. We provide the object images in the obj_imgs directory. The images of objects are stored in a certain directory and will replace the placeholders. Since the corresponding files of raw rgb images are very large (> 500G per dataset), we provide only one image for each object. If you have such files and want to sample images in the training stage, we also provide an implementation for object image sampling in the dataloader. You should unzip the images to the directories ${obj_img_base}.

# object images data structure
├── ${obj_img_base}
        ├── ScanNet
        ├── 3RScan
        └── ARkit

Object image sampling. Add 'mv_info' to data_type_list and add corresponding rgb images in the ${scan_family_base}, ${rscan_base} and ${ARkit_base} to enable image sampling.

one_bbox = random.choice(scan_data['mv_info'][int(inst_id)])
img_sample = self.scan_data_loader.get_one_img(one_bbox)

Text annotations. Put the text annotations in the directory 'msr3d_base'.

# object images data structure
├── ${msr3d_base}
        ├── scannet
        |     └── msqa_scannet_{split}.json
        ├── rscan
              └── msqa_rscan_{split}.json
        └── arkitscenes
              └── msqa_arkitscenes_{split}.json
├── ${msnn_base}
      └── scannet
            └── msnn_scannet.json

🕹 Training and evaluation

MSR3D model training:

sh msr3d.sh

MSR3D evaluation:

sh msr3d_test.sh

LEO model training:

msr3d_leo.sh

LEO model evaluation:

msr3d_leo_test.sh

📝 TODO List

Test set, with ground truth multi-view images, object locations and attributes
full dataset
Evaluation code
Training code
Baseline code for GPT-4o

🔗 Citation

If you find our work helpful, please cite:

@article{linghu2024multi,
  title={Multi-modal Situated Reasoning in 3D Scenes},
  author={Linghu, Xiongkun and Huang, Jiangyong and Niu, Xuesong and Ma, Xiaojian and Jia, Baoxiong and Huang, Siyuan},
  journal={Advances in Neural Information Processing Systems},
  year={2024}
}

💼 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

🪧 Acknowledgements

LEO: Our baseline model is built upon LEO.
SQA3D: SQA3D is a situated question-answering dataset based on ScanNet.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
GPT-4o_baseline		GPT-4o_baseline
asset		asset
common		common
configs		configs
data		data
evaluator		evaluator
model		model
modules		modules
optim		optim
tools		tools
trainer		trainer
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
launch.py		launch.py
msnn.sh		msnn.sh
msr3d.sh		msr3d.sh
msr3d_leo.sh		msr3d_leo.sh
msr3d_leo_test.sh		msr3d_leo_test.sh
msr3d_test.sh		msr3d_test.sh
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-modal Situated Reasoning in 3D Scenes

Data Distribution

Model

📋 Contents

📖 Introduction

🔥 News

🚀 Get Started

📁 Prepare data

🕹 Training and evaluation

📝 TODO List

🔗 Citation

💼 License

🪧 Acknowledgements

About

Uh oh!

Packages

Uh oh!

Contributors 2

Languages

License

MSR3D/MSR3D

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Situated Reasoning in 3D Scenes

Data Distribution

Model

📋 Contents

📖 Introduction

🔥 News

🚀 Get Started

📁 Prepare data

🕹 Training and evaluation

📝 TODO List

🔗 Citation

💼 License

🪧 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors 2

Languages

Packages