Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".
News: 🎉 This paper was selected as ICME 2025 Oral!
- inference code
- paper & supplementary material
- youtube demo
- training code
- fine-tuning code
chinese.mp4 |
korean.mp4 |
japanese.mp4 |
spanish.mp4 |
We compare our method with DiffTalk(CVPR23'), DINet(AAAI23'), IP-LAP(CVPR23'), MuseTalk(Arxiv2024), PC-AVS(CVPR21'), TalkLip(CVPR23'), Wav2Lip(MM'20)
Ours.mp4 |
DiffTalk.mp4 |
DINet.mp4 |
IP-LAP.mp4 |
MuseTalk.mp4 |
PC-AVS.mp4 |
TalkLIp.mp4 |
Wav2Lip.mp4 |
- Python 3.8.7
- torch 1.12.1
- torchvision 0.13.1
- librosa 0.9.2
- ffmpeg
First create conda environment:
conda create -n stsa python=3.8
conda activate stsa
Pytorch 1.12.1 is used, other requirements are listed in "requirements.txt". Please run:
pip install -r requirements.txt
Download the pretrained weights, and put the weights under ./checkpoints After this, run the following command:
python inference.py --video_path "demo_templates/video/speakerine.mp4" --audio_path "demo_templates/audio/education.wav"
You can specify the --video_path
and --audio_path
option to inference other videos.
-
Download LRS2 dataset, and move the LRS2/mvlrs_v1/main/ folder into ./processed_lrs2 folder.
-
Extract audio from LRS2 videos by running:
python preprocess/preprocess_audio.py --data_root ./processed_lrs2/main/ --out_root ./processed_lrs2/lrs2_audio
- Extract Wav2Vec 2.0 feature by running:
python preprocess/extract_wav2vec_feature.py
- Extract face, sketch, landmarks by running:
python preprocess/preprocess_face.py
- Convert sketch into heatmap by running:
python preprocess/preprocess_heatmap.py
After precessing, the processed_lrs2 folder structure is following:
./processed_lrs2/
├── lrs2_audio/
├── lrs2_face/
├── lrs2_heat_img_lower/
├── lrs2_heat_img_upper/
├── lrs2_heat_img_whole/
├── lrs2_heatmap_lower/
├── lrs2_heatmap_upper/
├── lrs2_heatmap_whole/
├── lrs2_landmarks/
├── lrs2_sketch_lower/
├── lrs2_sketch_upper/
├── lrs2_sketch_whole/
└── main/
Run the following command, and adjust the lr (56 line) to 1e-5 at 75k step, to 1e-6 at 130k step.
python train_stage1.py
First downlowad the syncnet pretrained weight from here, and put it under ./checkpoints/syncnet/ Then run the following command:
python train_stage2.py
In train_stage3.py, replace the finetune_path (60 line) and finetune_path_disc (61 line) to the face synthesizer and discriminator weight path which you've trained in stage2; Replace the heatmap_finetune_path (62 line) to the heatmap predictor weight path which you've trained in stage1. The run the following command:
python train_stage3.py
We thank IP-LAP, Wav2Lip, DINet, LAB and DIM for making their open-source resources available, which supported the development of this work.
If you find this project useful, welcome to cite us!
@article{ding2025stsa,
title={STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing},
author={Ding, Zijun and Xiong, Mingdie and Zhu, Congcong and Chen, Jingrun},
journal={arXiv preprint arXiv:2503.23039},
year={2025}
}