The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Overview

This is the official repository for The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.

The repo provides a unified framework for generalizable motion generation, including both modeling and evaluation:

ViMoGen Model: A Diffusion Transformer for generalizable motion generation, supporting Text-to-Motion (T2M) and Text/Motion-to-Motion (TM2M)
MBench Benchmark: A comprehensive evaluation benchmark that decomposes motion generation into nine dimensions across three pillars: Motion Generalization, Motion–Condition Consistency, and Motion Quality.

Together, ViMoGen and MBench enable end-to-end research on scalable and reliable motion generation.

Introduction

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage.

Motivated by this observation, we present ViMoGen, a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.

ViMoGen-228K Dataset: A large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples.
ViMoGen Model: A flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.
MBench Benchmark: A hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.

News

[2025-12-19] We have released the ViMoGen-DiT pretrained weights along with the core inference pipeline.
[2025-12-18] We have released the ViMoGen-228K Dataset and MBench leaderboard.

Release Plan

Inference Code: Core inference pipeline is released.
Pretrained Weights: ViMoGen-DiT weights are available.
Training System: Training code and ViMoGen-228K dataset release.
Evaluation Suite: Complete MBench evaluation scripts and data.
Motion-to-Motion Pipeline: Detailed guide and tools for custom reference motion preparation.

Installation

1. Create Conda Environment

conda create -n vigen python=3.10 -y
conda activate vigen

2. Install PyTorch

Install PyTorch with CUDA support. We recommend PyTorch 2.4+ with CUDA 12.1:

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Or via pip:

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

3. Install Requirements

pip install -r requirements.txt

4. Install Flash Attention (Recommended)

For better performance, install Flash Attention 2:

pip install flash-attn --no-build-isolation

5. Install PyTorch3D (Optional, for visualization)

PyTorch3D is needed for motion rendering and visualization:

# Option 1: Install from conda (recommended)
conda install pytorch3d -c pytorch3d
# Option 2: Install from source
pip install "git+https://github.com/facebookresearch/pytorch3d.git"

6. Download Body Models (Required for Visualization)

To visualize the generated motions, you need to download the SMPL-X model from the official website.

Register and download SMPLX_python_v1.1.zip (Python v1.1.0).

Extract the contents and place the model files (e.g., SMPLX_NEUTRAL.npz) in the following directory:

data/body_models/
└── smplx/
    ├── SMPLX_FEMALE.npz
    ├── SMPLX_MALE.npz
    └── SMPLX_NEUTRAL.npz

Note: We provide smplx_root.pt in data/body_models/ for coordinate alignment.

Pretrained Models

Download pretrained models and place them in the ./checkpoints/ directory:

Model	Description	Download Link / Command
ViMoGen-DiT-1.3B	Main motion generation model	Google Drive (Save as `./checkpoints/model.pt`)
Wan2.1-T2V-1.3B	Base text encoder weights	`huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./checkpoints/Wan2.1-T2V-1.3B`

Data & Benchmark

For evaluation on MBench, you need to download and extract the benchmark data:

Download mbench.tar.gz from Google Drive. This package includes:
- Reference motions generated by Wan 2.1 and processed by CameraHMR.
- T5 text embeddings for all prompts.
Extract to the ./data/ directory:
```
tar -xzvf mbench.tar.gz -C ./data/
```

Project Structure

ViMoGen/
├── checkpoints/                # Model checkpoints
├── configs/                    # Configuration files
│   ├── tm2m_train.yaml        # Training config
│   ├── tm2m_infer.yaml        # TM2M inference config
│   └── t2m_infer.yaml         # T2M inference config
├── data/                       # Data directory
│   ├── mbench/                # MBench benchmark data (Download required)
│   ├── meta_info/             # Metadata for training/testing
│   └── body_models/           # SMPL-X models and alignment files
├── data_samples/              # Example data for quick start
├── datasets/                   # Dataset loading utilities
├── models/                     # Model definitions
│   └── transformer/           # DiT transformer models
├── scripts/                    # Shell scripts
├── trainer/                    # Training utilities
├── parallel/                   # Distributed training utilities
└── train_eval_vimogen.py       # Main entry point

Inference

Text-to-Motion (T2M)

Generate motion from text prompts:

Edit prompts: Modify data_samples/example_archive.json with your desired text prompts
Extract text embeddings:
```
bash scripts/text_encoding.sh
```
Run inference:
```
bash scripts/t2m_infer.sh
```

Text/Motion-to-Motion (TM2M)

Generate motion conditioned on both text and reference motion:

Prepare reference motion:
- Option A: Use MBench Benchmark. We provide pre-processed MBench data with stored reference motions in ./data/mbench/ for immediate evaluation.
- Option B: Custom Preparation.
  - Generate a reference video using a video generation model (e.g., Wan 2.1)
  - Extract motion from the video using a visual motion capture model (e.g., CameraHMR)
  - Detailed custom preparation guide coming soon (TBD).
Run inference:
```
bash scripts/tm2m_infer.sh
```

Explore More SMPLCap Projects

[TPAMI'25] SMPLest-X: An extended version of SMPLer-X with stronger foundation models.
[ICML'25] ADHMR: A framework to align diffusion-based human mesh recovery methods via direct preference optimization.
[ECCV'24] WHAC: World-grounded human pose and camera estimation from monocular videos.
[CVPR'24] AiOS: An all-in-one-stage pipeline combining detection and 3D human reconstruction.
[NeurIPS'23] SMPLer-X: Scaling up EHPS towards a family of generalist foundation models.
[NeurIPS'23] RoboSMPLX: A framework to enhance the robustness of whole-body pose and shape estimation.
[ICCV'23] Zolly: 3D human mesh reconstruction from perspective-distorted images.
[arXiv'23] PointHPS: 3D HPS from point clouds captured in real-world settings.
[NeurIPS'22] HMR-Benchmarks: A comprehensive benchmark of HPS datasets, backbones, and training strategies.

Citation

If you find this work useful, please cite our paper:

@article{lin2025questgeneralizablemotiongeneration,
      title={The Quest for Generalizable Motion Generation: Data, Model, and Evaluation}, 
      author={Jing Lin and Ruisi Wang and Junzhe Lu and Ziqi Huang and Guorui Song and Ailing Zeng and Xian Liu and Chen Wei and Wanqi Yin and Qingping Sun and Zhongang Cai and Lei Yang and Ziwei Liu},
      year={2025},
      journal={arXiv preprint arXiv:2510.26794}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Overview

Introduction

News

Release Plan

Installation

1. Create Conda Environment

2. Install PyTorch

3. Install Requirements

4. Install Flash Attention (Recommended)

5. Install PyTorch3D (Optional, for visualization)

6. Download Body Models (Required for Visualization)

Pretrained Models

Data & Benchmark

Project Structure

Inference

Text-to-Motion (T2M)

Text/Motion-to-Motion (TM2M)

Explore More SMPLCap Projects

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs		configs
data		data
data_samples		data_samples
datasets		datasets
models		models
motion_rep		motion_rep
parallel		parallel
scripts		scripts
trainer		trainer
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train_eval_vimogen.py		train_eval_vimogen.py
utils.py		utils.py

SMPLCap/ViMoGen

Folders and files

Latest commit

History

Repository files navigation

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Overview

Introduction

News

Release Plan

Installation

1. Create Conda Environment

2. Install PyTorch

3. Install Requirements

4. Install Flash Attention (Recommended)

5. Install PyTorch3D (Optional, for visualization)

6. Download Body Models (Required for Visualization)

Pretrained Models

Data & Benchmark

Project Structure

Inference

Text-to-Motion (T2M)

Text/Motion-to-Motion (TM2M)

Explore More SMPLCap Projects

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages