The official PyTorch implementation of the paper "GMD: Controllable Human Motion Synthesis via Guided Diffusion Models".
For more details, visit our project page.
📢
20/Dec/23 - We release DNO: Optimizing Diffusion Noise Can Serve As Universal Motion Priors, a follow-up work that looks at how to effectively use diffusion model and guidance to tackle many motion tasks.
28/July/23 - First release.
If you find this code useful in your research, please cite:
@inproceedings{karunratanakul2023gmd,
title = {Guided Motion Diffusion for Controllable Human Motion Synthesis},
author = {Karunratanakul, Korrawe and Preechakul, Konpat and Suwajanakorn, Supasorn and Tang, Siyu},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages = {2151--2162},
year = {2023}
}
This code was tested on Ubuntu 20.04 LTS and requires:
- Python 3.7
- conda3 or miniconda3
- CUDA capable GPU (one is enough)
Install ffmpeg (if not already installed):
sudo apt update
sudo apt install ffmpegFor windows use this instead.
GMD shares a large part of its base dependencies with the MDM. However, you might find it easier to install our dependencies from scratch due to some key version differences.
Setup conda env:
conda env create -f environment_gmd.yml
conda activate gmd
conda remove --force ffmpeg
python -m spacy download en_core_web_sm
pip install git+https://github.com/openai/CLIP.gitDownload dependencies:
Text to Motion
bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.shUnconstrained
bash prepare/download_smpl_files.sh
bash prepare/download_recognition_unconstrained_models.shThere are two paths to get the data:
(a) Generation only wtih pretrained text-to-motion model without training or evaluating
(b) Get full data to train and evaluate the model.
HumanML3D - Clone HumanML3D, then copy the data dir to our repository:
cd ..
git clone https://github.com/EricGuo5513/HumanML3D.git
unzip ./HumanML3D/HumanML3D/texts.zip -d ./HumanML3D/HumanML3D/
cp -r HumanML3D/HumanML3D guided-motion-diffusion/dataset/HumanML3D
cd guided-motion-diffusion
cp -a dataset/HumanML3D_abs/. dataset/HumanML3D/[Important !]
Because we change the representation of the root joint from relative to absolute, you need to replace the original files and run our version of motion_representation.ipynb and cal_mean_variance.ipynb provided in ./HumanML3D_abs/ instead to get the absolute-root data.
HumanML3D - Follow the instructions in HumanML3D, then copy the result dataset to our repository:
Then copy the data to our repository
cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3DDownload both models, then unzip and place them in ./save/.
Both models are trained on the HumanML3D dataset.
Text to Motion - Without spatial conditioning
This part is a standard text-to-motion generation.
Note: We change the behavior of the --num_repetitions flag from the original MDM repo to facilitate the two-staged pipeline and logging. We only support --num_repetitions 1 at this moment.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --num_samples 10python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --input_text ./assets/example_text_prompts.txtpython -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is picking up something on the floor"Text to Motion - With keyframe locations conditioning
The predefined pattern can be found in get_kframes() in sample/keyframe_pattern.py. You can add more patterns there using the same format [(frame_num_1, (x_1, z_1)), (frame_num_2, (x_2, z_2)), ...] where x and z are the location of the root joint on the plane in the world coordinate system.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode kps(In development) Using the --interactive flag will start an interactive window that allows you to choose the keyframes yourself. The interactive pattern will override the predefined pattern.
Text to Motion - With keyframe locations conditioning and obstacle avoidance
Similarly, the pattern is defined in get_obstacles() in sample/keyframe_pattern.py. You can add more patterns using the format ((x, z), radius) currently we only support circle obstacle due to the ease of defining SDF, but you can add any shape with valid SDF.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode sdf --seed 11Text to Motion - With trajectory conditioning
The trajectory-conditioned generation is a special case of keyframe-conditioned generation, where all the frames are keyframes.
The sample trajectory we used can be found in ./save/template_joints.npy. You can also use your own trajectory by providing the list of ground_positions.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode trajectory(In development) Using the --interactive flag will start an interactive window that allows you to draw a trajectory that will override the predefined pattern.
You may also define:
--deviceid.--seedto sample different prompts.--motion_length(text-to-motion only) in seconds (maximum is 9.8[sec]).--progressto save the denosing progress.
Running those will get you:
results.npyfile with text prompts and xyz positions of the generated animationsample##_rep##.mp4- a stick figure animation for each generated motion.trajec_##_####- a plot of the trajectory at each denoising step of the trajectory model. The final trajectory is then used to generate the motion.motion_trajec_##_####- a plot of the trajectory of the generated motion at each denoising step of the motion model.
You can stop here, or render the SMPL mesh using the following script.
To create SMPL mesh per frame run:
python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/fileThis script outputs:
sample##_rep##_smpl_params.npy- SMPL parameters (thetas, root translations, vertices and faces)sample##_rep##_obj- Mesh per frame in.objformat.
Notes:
- The
.objcan be integrated into Blender/Maya/3DS-MAX and rendered using them. - This script is running SMPLify and needs GPU as well (can be specified with the
--deviceflag). - Important - Do not change the original
.mp4path before running the script.
Notes for 3d makers:
- You have two ways to animate the sequence:
- Use the SMPL add-on and the theta parameters saved to
sample##_rep##_smpl_params.npy(we always use beta=0 and the gender-neutral model). - A more straightforward way is using the mesh data itself. All meshes have the same topology (SMPL), so you just need to keyframe vertex locations.
Since the OBJs are not preserving vertices order, we also save this data to the
sample##_rep##_smpl_params.npyfile for your convenience.
- Use the SMPL add-on and the theta parameters saved to
GMD is trained on the HumanML3D dataset.
python -m train.train_trajectorypython -m train.train_gmdEssentially, the same command is used for both the trajectory model and the motion model. You can select which model to train by changing the train_args. The training options can be found in ./configs/card.py.
- Use
--deviceto define GPU id. - Add
--train_platform_type {ClearmlPlatform, TensorboardPlatform}to track results with either ClearML or Tensorboard.
All evaluation are done on the HumanML3D dataset.
- Takes about 20 hours (on a single GPU)
- The output of this script for the pre-trained models (as was reported in the paper) is provided in the checkpoints zip file.
python -m eval.eval_humanml --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.ptFor each prompt, we use the ground truth trajectory as conditions.
python -m eval.eval_humanml --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --full_traj_inpaintFor each prompt, 5 keyframes are sampled from the ground truth motion. The ground locations of the root joint in those frames are used as conditions.
python -m eval.eval_humanml_condition --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.ptWe would like to thank the following contributors for the great foundation that we build upon:
MDM, guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, MoDi.
This code is distributed under an MIT LICENSE.
Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.




