This is the official repository for The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.
The repo provides a unified framework for generalizable motion generation, including both modeling and evaluation:
-
ViMoGen Model: A Diffusion Transformer for generalizable motion generation, supporting Text-to-Motion (T2M) and Text/Motion-to-Motion (TM2M)
-
MBench Benchmark: A comprehensive evaluation benchmark that decomposes motion generation into nine dimensions across three pillars: Motion Generalization, Motion–Condition Consistency, and Motion Quality.
Together, ViMoGen and MBench enable end-to-end research on scalable and reliable motion generation.
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage.
Motivated by this observation, we present ViMoGen, a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.
- ViMoGen-228K Dataset: A large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples.
- ViMoGen Model: A flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.
- MBench Benchmark: A hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.
- [2025-12-19] We have released the ViMoGen-DiT pretrained weights along with the core inference pipeline.
- [2025-12-18] We have released the ViMoGen-228K Dataset and MBench leaderboard.
- Inference Code: Core inference pipeline is released.
- Pretrained Weights: ViMoGen-DiT weights are available.
- Training System: Training code and ViMoGen-228K dataset release.
- Evaluation Suite: Complete MBench evaluation scripts and data.
- Motion-to-Motion Pipeline: Detailed guide and tools for custom reference motion preparation.
conda create -n vigen python=3.10 -y
conda activate vigenInstall PyTorch with CUDA support. We recommend PyTorch 2.4+ with CUDA 12.1:
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidiaOr via pip:
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txtFor better performance, install Flash Attention 2:
pip install flash-attn --no-build-isolationPyTorch3D is needed for motion rendering and visualization:
# Option 1: Install from conda (recommended)
conda install pytorch3d -c pytorch3d
# Option 2: Install from source
pip install "git+https://github.com/facebookresearch/pytorch3d.git"To visualize the generated motions, you need to download the SMPL-X model from the official website.
- Register and download
SMPLX_python_v1.1.zip(Python v1.1.0). - Extract the contents and place the model files (e.g.,
SMPLX_NEUTRAL.npz) in the following directory:data/body_models/ └── smplx/ ├── SMPLX_FEMALE.npz ├── SMPLX_MALE.npz └── SMPLX_NEUTRAL.npz
Note: We provide smplx_root.pt in data/body_models/ for coordinate alignment.
Download pretrained models and place them in the ./checkpoints/ directory:
| Model | Description | Download Link / Command |
|---|---|---|
| ViMoGen-DiT-1.3B | Main motion generation model | Google Drive (Save as ./checkpoints/model.pt) |
| Wan2.1-T2V-1.3B | Base text encoder weights | huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./checkpoints/Wan2.1-T2V-1.3B |
For evaluation on MBench, you need to download and extract the benchmark data:
- Download
mbench.tar.gzfrom Google Drive. This package includes:- Reference motions generated by Wan 2.1 and processed by CameraHMR.
- T5 text embeddings for all prompts.
- Extract to the
./data/directory:tar -xzvf mbench.tar.gz -C ./data/
ViMoGen/
├── checkpoints/ # Model checkpoints
├── configs/ # Configuration files
│ ├── tm2m_train.yaml # Training config
│ ├── tm2m_infer.yaml # TM2M inference config
│ └── t2m_infer.yaml # T2M inference config
├── data/ # Data directory
│ ├── mbench/ # MBench benchmark data (Download required)
│ ├── meta_info/ # Metadata for training/testing
│ └── body_models/ # SMPL-X models and alignment files
├── data_samples/ # Example data for quick start
├── datasets/ # Dataset loading utilities
├── models/ # Model definitions
│ └── transformer/ # DiT transformer models
├── scripts/ # Shell scripts
├── trainer/ # Training utilities
├── parallel/ # Distributed training utilities
└── train_eval_vimogen.py # Main entry point
Generate motion from text prompts:
-
Edit prompts: Modify
data_samples/example_archive.jsonwith your desired text prompts -
Extract text embeddings:
bash scripts/text_encoding.sh
-
Run inference:
bash scripts/t2m_infer.sh
Generate motion conditioned on both text and reference motion:
-
Prepare reference motion:
- Option A: Use MBench Benchmark. We provide pre-processed MBench data with stored reference motions in
./data/mbench/for immediate evaluation. - Option B: Custom Preparation.
- Option A: Use MBench Benchmark. We provide pre-processed MBench data with stored reference motions in
-
Run inference:
bash scripts/tm2m_infer.sh
Explore More SMPLCap Projects
- [TPAMI'25] SMPLest-X: An extended version of SMPLer-X with stronger foundation models.
- [ICML'25] ADHMR: A framework to align diffusion-based human mesh recovery methods via direct preference optimization.
- [ECCV'24] WHAC: World-grounded human pose and camera estimation from monocular videos.
- [CVPR'24] AiOS: An all-in-one-stage pipeline combining detection and 3D human reconstruction.
- [NeurIPS'23] SMPLer-X: Scaling up EHPS towards a family of generalist foundation models.
- [NeurIPS'23] RoboSMPLX: A framework to enhance the robustness of whole-body pose and shape estimation.
- [ICCV'23] Zolly: 3D human mesh reconstruction from perspective-distorted images.
- [arXiv'23] PointHPS: 3D HPS from point clouds captured in real-world settings.
- [NeurIPS'22] HMR-Benchmarks: A comprehensive benchmark of HPS datasets, backbones, and training strategies.
If you find this work useful, please cite our paper:
@article{lin2025questgeneralizablemotiongeneration,
title={The Quest for Generalizable Motion Generation: Data, Model, and Evaluation},
author={Jing Lin and Ruisi Wang and Junzhe Lu and Ziqi Huang and Guorui Song and Ailing Zeng and Xian Liu and Chen Wei and Wanqi Yin and Qingping Sun and Zhongang Cai and Lei Yang and Ziwei Liu},
year={2025},
journal={arXiv preprint arXiv:2510.26794},
}