(IEEE TCSVT 2025)
Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han β¨
Paper (TCSVT 2025) π
Co-speech gesture video generation aims to synthesize expressive talking videos from a still portrait and a speech audio track π¬πΆ. However, purely audio-controlled methods often:
- Miss large body and hand motions π€¦ββοΈ
- Struggle to emphasize key motion regions (face, lips, hands, upper body) π
- Introduce temporal flickering or visual artifacts π₯
MMGT addresses these issues with a motion-maskβguided two-stage framework:
-
SMGA β Spatial Mask-Guided Audio2Pose π§β‘οΈπ
- Converts audio into high-quality pose videos
- Predicts motion masks to highlight regions with significant movement (face, lips, hands, upper body) π―
-
Diffusion-based Video Generator with MM-HAA β Motion-Masked Hierarchical Audio Attention π₯
- A stabilized diffusion video model
- Takes audio, pose, and motion masks as input
- Generates temporally stable, lip-synchronized, and detail-controllable gesture videos πΊ
![]() |
![]() |
![]() |
![]() |
- 2025-09-01: Our paper
βMMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generationβ
has been accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025. π
DOI: 10.1109/TCSVT.2025.3604109 π
We plan to open-source MMGT around September 2025, focusing on the following four deliverables:
- Video demos π½οΈ
- Inference code (including long-video support) π»
- Training code π οΈ
- Multi-person & multi-scene model weights π€
We recommend the following setup:
- Python:
>= 3.10π - CUDA:
= 12.4
(Other versions may work but are not thoroughly tested.) π»
conda create -n MMGT python=3.10
conda activate MMGT
pip install -r requirements.txtPre-trained weights are available on HuggingFace:
Download the checkpoints and place them according to the paths specified in the config files under ./configs.
Note: The current implementation supports video lengths of up to 3.2 seconds β±οΈ.
Extended / long-video generation will be released together with the full open-source version π.
End-to-end generation from audio + single image:
python scripts/audio2vid.py -c ./configs/prompts/animation.yaml --image_path /path/to/your/image.png --audio_path /path/to/your/audio.wav --out_dir /path/to/output_dirIf you already have pose and motion-mask videos (e.g., from Stage 1 or other methods), you can directly drive the video generator:
python scripts/pose2vid.py -c ./configs/prompts/animation.yaml --image_path /path/to/img.png --pose_path /path/to/pose.mp4 --face_mask_path /path/to/face.mp4 --lips_mask_path /path/to/lips.mp4 --hands_mask_path /path/to/hands.mp4 --out_dir ./outputsFor detailed data preparation (including dataset structure, preprocessing scripts, and examples), please refer to the data pipeline of:
https://github.com/thuhcsi/S2G-MDDiffusion#-data-preparation
python -m scripts.data_preprocess --input_dir "Path to the 512Γ512 training or test video files processed according to the above procedure"
python data/extract_movment_mask_all.py --input_root "Path to the 512Γ512 training or test video files processed according to the above procedure"
|-- data/train/
|-- keypoints/
| |-- 0001.npy
| |-- 0002.npy
| |-- 0003.npy
| `-- 0004.npy
|-- audios/
| |-- 0001.wav
| |-- 0002.wav
| |-- 0003.wav
| `-- 0004.wav
cd data
python create_dataset.py --extract-baseline --extract-wavlm
cd .. |--- data/train/
| |--- videos
| | |--- chemistry#99999.mp4
| | |--- oliver#88888.mp4
| |--- audios
| | |--- chemistry#99999.wav
| | |--- oliver#88888.wav |--- data/train/
| |--- videos
| | |--- chemistry#99999.mp4
| | |--- oliver#88888.mp4
| |--- audios
| | |--- chemistry#99999.wav
| | |--- oliver#88888.wav
| |--- sep_lips_mask
| | |--- chemistry#99999.mp4
| | |--- oliver#88888.mp4
| |--- sep_face_mask
| | |--- chemistry#99999.mp4
| | |--- oliver#88888.mp4
| |--- videos_dwpose
| | |--- chemistry#99999.mp4
| | |--- oliver#88888.mp4
| |--- audio_emb
| | |--- chemistry#99999.pt
| | |--- oliver#88888.ptpython scripts/extract_meta_info_stage1.py -r data/videos -n data
python tool/extract_meta_info_stage2_move_mask.py --root_path data/train --dataset_name my_dataset --meta_info_name dataTrain Process 1 β SMGA (Audio2Pose + Motion Masks)
accelerate train_a2p.pyThis stage learns to map raw speech audio to:
- Pose sequences π
- Region-specific motion masks (face, lips, hands, upper body) π¦ΈββοΈ
Train Process 2 β Diffusion Video Generator (with MM-HAA)
accelerate launch train_stage_1.py --config configs/train/stage1.yamlaccelerate launch train_stage_2.py --config configs/train/stage2.yamlThis stage fine-tunes the diffusion model to:
- Jointly use audio, poses, and motion masks
- Produce synchronized, artifact-free gesture videos π½οΈ
- Emphasize large-motion regions through Motion-Masked Hierarchical Audio Attention (MM-HAA) π―
If you find MMGT useful in your research, please consider citing our TCSVT 2025 paper:
@ARTICLE{11145152,
author = {Wang, Siyuan and Liu, Jiawei and Wang, Wei and Jin, Yeying and Du, Jinsong and Han, Zhi},
journal = {IEEE Transactions on Circuits and Systems for Video Technology},
title = {MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation},
year = {2025},
volume = {},
number = {},
pages = {1-1},
keywords= {Videos;Faces;Synchronization;Hands;Lips;Training;Electronic mail;Distortion;data mining;Circuits and systems;Spatial Mask Guided Audio2Pose Generation Network (SMGA);Co-speech Video Generation;Motion Masked Hierarchical Audio Attention (MM-HAA)},
doi = {10.1109/TCSVT.2025.3604109}
}


