HIS-GPT is a large multi-modal foundation model for human-in-scene (HIS) understanding, a new task that we raise for understanding human behaviors in 3D scenes. To evaluate this new task, we also release HIS-Bench, the first multi-modal benchmark for comprehensively evaluating model's abilities on human-in-scene understanding. [Paper]
TODO:
- Upload the training & evaluation code.
- Release the annotations of HIS-Bench and HIS-GPT training data.
- Release the pretrained weights of HIS-GPT.
HIS-Bench data could be downloaded from Huggingface: this link.
- The dataset contains the following components:
qas_val: all the question-answering samples of HIS-Bench, divided into separate.jsonfiles for each sub-task. A data example looks like:{ "task": "activity", "index": 0, "data_id": "PROX#BasementSittingBooth_00142_01#40.0_50.0", "scene_id": "BasementSittingBooth", "motion_id": "PROX#BasementSittingBooth_00142_01#40.0_50.0", "qa": [{"question": "What is the person doing initially?", "answer": "He sits at a table."] }pcd_all: the 3D point cloud data for every 3D scene in HIS-Bench, named as<scene_id>.pth.motion_tokens: the token ids for each 3D motion in HIS-Bench, extracted by M3GPT. Named as<motion_id>.npy.motion_trajs: the 2D trajectories for each 3D motion in HIS-Bench. Named as `<data_id>.npy'.hisbench_mask3d_uni3d_feats.pt: the 3D scene representations of HIS-Bench, extracted by Uni3D and can be directly used for HIS-GPT inference.
See EVALUATION.md.
Environmental Setup
conda create -n hisgpt python=3.10
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Data Preparation
Download the HIS-GPT training data from here.
Put all the data under the ./annotations directory. Unzip the .zip files in the subdirectories. You will get the directory contains the following contents:
annotations
├── scannet_mask3d_uni3d_feats.pt # 3D scene representations for ScanNet scenes (used by HUMANISE and SceneVerse)
├── scannet_mask3d_train_attributes.pt # 3D scene attributes for ScanNet scenes (used by HUMANISE and SceneVerse)
├── trumans_mask3d_uni3d_feats.pt # 3D scene representations for TRUMANS scenes
├── trumans_mask3d_train_attributes.pt # 3D scene attributes for TRUMANS scenes
├── m3gpt_t2m_motion_embeds.pt # embedding vectors for human motions
├── humanise/trumans # annotations for human-in-scene data
├── qas_pt_v1 # HUMANISE captions for pre-training
├── qas_train_v1 # HUMANISE QA data for instruction tuning
├── motion_tokens # tokens for 3D human motions
└── motion_trajs # trajectory for 3D human motions
├── sceneverse # annotations for SceneVerse (scene-only) data
└── motionx # annotations for HumanML3D (motion-only) data
For 3D scene and 3D human motion data, we pre-extracted them into latent embeddings using the relevant encoders (to save storage). That is, the features and attributes in our provided annotations are directly fed into the projection layers and the large language model when you run the training codes.
Note: If you want to extract 3D scene features (..._uni3d_feats.pt and ..._train_attributes.pt) from the raw data, you could refer to this guidance.
Model Preparation
Download vicuna-7b-v1.5, which is the model we will use as the pre-trained LLM.
- Configurations before training
- Set the
llama_model_pathinscripts/human_scene_pt.shandscripts/human_scene_it.shto your own vicuna-7b-v1.5 path. - Set
output_dirto your own output directory.
- Set the
Step 1: Multi-modal pre-training:
bash scripts/human_scene_pt.sh
Step 2: Human-in-scene instruction tuning:
bash scripts/human_scene_it.sh
See VL_GENERATION.md.
If you find our paper useful, please consider citing:
@misc{zhao2025hisgpt3dhumaninscenemultimodal,
title={HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding},
author={Jiahe Zhao and Ruibing Hou and Zejie Tian and Hong Chang and Shiguang Shan},
year={2025},
eprint={2503.12955},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12955},
}
This code implementation is based on Chat-Scene and M3GPT. Thanks to their awesome work!
