Cordelia Schmid3, Michael J. Black1, Dimitrios Tzionas2
1Max Planck Institute for Intelligent Systems, Tรผbingen
2University of Amsterdamย ย ย 3Inria, France
InteractVLM estimates 3D contact points on both human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. We introduce a novel task, Semantic Human Contact, which goes beyond the traditional Binary Human Contact to infer object-specific contacts on bodies. By leveraging the rich visual knowledge of large Vision-Language Models, we address the limited availability of ground-truth 3D interaction data for training, resulting in better generalization to diverse real-world interactions.
# | Model | Type | Training Datasets | Comment | Status |
---|---|---|---|---|---|
1 | interactvlm-3d-hcontact-damon |
DAMON | Winner of RHOBIN Human Contact Challenge (CVPR 2025) |
|
|
2 | interactvlm-3d-hcontact-wScene |
DAMON + LEMON-HU + RICH | Best in-the-wild 3D Human Contact Estimation (with foot ground contact) |
|
|
3 | interactvlm-3d-oafford-lemon-piad |
LEMON-OBJ + PIAD | Estimates Object Affordance |
|
|
4 | interactvlm-2d-hcontact |
Extended LISA by projecting DAMON contact on images | 2D Human Contact Segmentation via Referring Segmentation |
|
|
5 | interactvlm-3d-hcontact-ocontact |
|
DAMON + LEMON-HU + RICH + LEMON-OBJ + PIAD + PICO + HOI-VQA# | Single Model for Joint 3D Human Object Contact Estimation |
|
* The interactvlm-joint-reconstruction
model is trained with our new PICO Dataset (CVPR 2025), which enables accurate 3D object contact estimation unlike object affordance using LEMON-OBJ and PIAD dataset.
# We use GPT-4o image model to generate HOI-VQA for DAMON, LEMON and PIAD images. The script for calling OpenAI API, raw data and preprocessing scripts are here.
-
Install Micromamba (if not already installed):
curl -Ls https://micro.mamba.pm/api/download/linux-64/latest | tar -xvj bin/micromamba sudo mv bin/micromamba /usr/local/bin/
-
Create and activate environment:
micromamba create -n interactvlm python=3.10 -c conda-forge micromamba activate interactvlm
-
Install PyTorch with CUDA 12.1:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
-
Clone the repository:
git clone https://github.com/saidwivedi/InteractVLM.git cd InteractVLM
-
Install dependencies:
pip install -r requirements.txt pip install flash-attn --no-build-isolation DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.15.1
-
Setup Environment:
# Before running demo, training or evaluation scripts, ensure CUDA is properly configured export CUDA_HOME=/usr/local/cuda # or your CUDA installation path export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
InteractVLM/
โโโ ๐ model/ # Core model implementation
โโโ ๐ datasets/ # Data loading and processing
โโโ ๐ utils/ # Utility functions
โโโ ๐ preprocess_data/ # Data preprocessing scripts
โโโ ๐ scripts/ # Execution scripts
โโโ ๐ data/ # Dataset folders, Body models, Demo samples
โโโ ๐ trained_models/ # Trained models
โโโ ๐ train.py # Main training script
โโโ ๐ evaluate.py # Main evaluation script
โโโ ๐ run_demo.py # Run Demo
โโโ ๐ requirements.txt # Python dependencies
To run InteractVLM, you need to download essential data files and pre-trained models. We provide a convenient script to handle this process.
-
Register for access at https://interactvlm.is.tue.mpg.de/login.php to get your credentials
-
Run the download script:
bash fetch_data.sh
Run the demo on your own images with either human or object interaction estimation modes:
# For 3D human contact estimation
bash scripts/run_demo.sh hcontact data/demo_samples folder
# For 2D human contact segmentation
bash scripts/run_demo.sh h2dcontact data/demo_samples file
# For 3D object affordance estimation
bash scripts/run_demo.sh oafford data/demo_samples folder
Demo Requirements:
-
Human Contact Demo: The canonical human mesh and rendered input are already provided. Simply run the script to estimate 3D contact points on human bodies. We now also support human contact estimation with scene (e.g. ground or undefined objects) with the latest released model. Download the latest model using
hcontact-wScene
argument infetch_data.sh
and use the same argument while running the demo script. The object name in the image filename serves as the query object for contact estimation (e.g., "bottle" or "chair"). To estimate contact with the scene or ground, use "scene" as the query or prefix the filename with "scene". -
2D Human Contact Demo: Performs 2D contact segmentation directly on the input image using referring segmentation. This extends LISA's capabilities for human-object contact detection in 2D space. The object name in the image filename serves as the query object for contact estimation.
-
Object Affordance Demo: The code expects an object mesh as input. The script will automatically render multiple views of the object for affordance prediction.
Input Modes:
The demo supports two input structures:
- Folder-based mode (default): Each sample in its own folder (required for 3D human contact and object affordance)
- File-based mode: All samples as files in a single folder. Supported for:
- 2D Human Contact (
h2dcontact
): Direct segmentation on input images - 3D Human Contact (
hcontact
): Estimating human contact for video frames
- 2D Human Contact (
Sample Data: The data/demo_samples/
directory contains ready-to-use samples for testing both human contact and object affordance estimation. One should get the following results:
To generate the data needed for training, run the following script. For now, we provide preprocessed dataset for DAMON. We will soon release for LEMON, PIAD and PICO.
# Generate preprocessed data
bash scripts/run_datagen.sh
To train 3D Human Contact Estimation using DAMON dataset, download the preprocessed dataset using the following command and place it under data/damon
. Then run the training script.
# Download preprocessed DAMON dataset
bash fetch_data.sh damon-dataset
# Train human contact with DAMON dataset
bash scripts/run_train.sh hcontact-damon
If you have trained a new model, prepare the weights for evaluation:
# Prepare weights for model 0 (adjust number as needed)
bash scripts/run_prepare_weights.sh 0
# Evaluate the model on either DAMON or PIAD. Adjust the congfiguration accordingly
bash scripts/run_eval.sh
- 3D Human Contact Estimation - Training, evaluation, and demo code available
- 3D Object Contact/Affordance Estimation - Training, evaluation, and demo code available
- Object Shape Retrieval from Single Image - Code release pending
- Optimization Pipeline for Joint Reconstruction - Code release pending
We thank Alpรกr Cseke for his assistance with evaluating joint human-object reconstruction. We also thank Tsvetelina Alexiadis and Taylor Obersat for MTurk evaluation, Yao Feng, Peter Kulits, and Markos Diomataris for their valuable feedback and Benjamin Pellkofer for IT support. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). The UvA part of the team is supported by an ERC Starting Grant (STRIPES, 101165317, PI: D. Tzionas).
InteractVLM builds upon several excellent open-source projects and datasets:
- LISA - InteractVLM is built on top of this foundational framework
- LEMON, DECO, PIAD, PICO and RICH - For human contact and object affordance data
- Blendify - For rendering
Our optimization pipeline integrates the following repositories:
- OpenShape - For object shape retrieval
- OSX - For SMPLX human pose estimation
- Grounded-SAM - For object detection and segmentation
- Depth Pro - For depth estimation
If you find this code useful for your research, please consider citing the following paper:
@inproceedings{dwivedi_interactvlm_2025,
title = {{InteractVLM}: {3D} Interaction Reasoning from {2D} Foundational Models},
author = {Dwivedi, Sai Kumar and Antiฤ, Dimitrije and Tripathi, Shashank and Taheri, Omid and Schmid, Cordelia and Black, Michael J. and Tzionas, Dimitrios},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}
This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.
For code related questions, please contact sai.dwivedi@tuebingen.mpg.de
For commercial licensing (and all related questions for business applications), please contact ps-licensing@tue.mpg.de.