STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".

News: 🎉 This paper was selected as ICME 2025 Oral!

Todo:

Demo:

Multilingual Generation

chinese.mp4	korean.mp4
japanese.mp4	spanish.mp4

Long Video Generation Compared with SOTA Methods

We compare our method with DiffTalk(CVPR23'), DINet(AAAI23'), IP-LAP(CVPR23'), MuseTalk(Arxiv2024), PC-AVS(CVPR21'), TalkLip(CVPR23'), Wav2Lip(MM'20)

Ours.mp4	DiffTalk.mp4	DINet.mp4	IP-LAP.mp4
MuseTalk.mp4	PC-AVS.mp4	TalkLIp.mp4	Wav2Lip.mp4

Inference:

Requirements

Python 3.8.7
torch 1.12.1
torchvision 0.13.1
librosa 0.9.2
ffmpeg

Prepare Environment

First create conda environment:

conda create -n stsa python=3.8
conda activate stsa

Pytorch 1.12.1 is used, other requirements are listed in "requirements.txt". Please run:

pip install -r requirements.txt

Quick Start

Download the pretrained weights, and put the weights under ./checkpoints After this, run the following command:

python inference.py --video_path "demo_templates/video/speakerine.mp4" --audio_path "demo_templates/audio/education.wav"

You can specify the --video_path and --audio_path option to inference other videos.

Training:

Dataset Pre-process

Download LRS2 dataset, and move the LRS2/mvlrs_v1/main/ folder into ./processed_lrs2 folder.
Extract audio from LRS2 videos by running:

python preprocess/preprocess_audio.py --data_root ./processed_lrs2/main/ --out_root ./processed_lrs2/lrs2_audio

Extract Wav2Vec 2.0 feature by running:

python preprocess/extract_wav2vec_feature.py

Extract face, sketch, landmarks by running:

python preprocess/preprocess_face.py

Convert sketch into heatmap by running:

python preprocess/preprocess_heatmap.py

After precessing, the processed_lrs2 folder structure is following:

./processed_lrs2/
├── lrs2_audio/
├── lrs2_face/
├── lrs2_heat_img_lower/
├── lrs2_heat_img_upper/
├── lrs2_heat_img_whole/
├── lrs2_heatmap_lower/
├── lrs2_heatmap_upper/
├── lrs2_heatmap_whole/
├── lrs2_landmarks/
├── lrs2_sketch_lower/
├── lrs2_sketch_upper/
├── lrs2_sketch_whole/
└── main/

Stage1: Heatmap Predictor Training

Run the following command, and adjust the lr (56 line) to 1e-5 at 75k step, to 1e-6 at 130k step.

python train_stage1.py

Stage2: Face Synthesizer Training

First downlowad the syncnet pretrained weight from here, and put it under ./checkpoints/syncnet/ Then run the following command:

python train_stage2.py

Stage3: End-to-end Fine-tuning

In train_stage3.py, replace the finetune_path (60 line) and finetune_path_disc (61 line) to the face synthesizer and discriminator weight path which you've trained in stage2; Replace the heatmap_finetune_path (62 line) to the heatmap predictor weight path which you've trained in stage1. The run the following command:

python train_stage3.py

Acknowledge:

We thank IP-LAP, Wav2Lip, DINet, LAB and DIM for making their open-source resources available, which supported the development of this work.

Citation:

If you find this project useful, welcome to cite us!

@article{ding2025stsa,
  title={STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing},
  author={Ding, Zijun and Xiong, Mingdie and Zhu, Congcong and Chen, Jingrun},
  journal={arXiv preprint arXiv:2503.23039},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
checkpoints/syncnet		checkpoints/syncnet
demo_templates		demo_templates
inserts		inserts
models		models
preprocess		preprocess
processed_lrs2		processed_lrs2
MSEloss.py		MSEloss.py
README.md		README.md
boundary_heatmap_draw.py		boundary_heatmap_draw.py
draw_landmark.py		draw_landmark.py
inference.py		inference.py
loss.py		loss.py
requirements.txt		requirements.txt
train_stage1.py		train_stage1.py
train_stage2.py		train_stage2.py
train_stage3.py		train_stage3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Todo:

Demo:

Multilingual Generation

Long Video Generation Compared with SOTA Methods

Inference:

Requirements

Prepare Environment

Quick Start

Training:

Dataset Pre-process

Stage1: Heatmap Predictor Training

Stage2: Face Synthesizer Training

Stage3: End-to-end Fine-tuning

Acknowledge:

Citation:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SCAILab-USTC/STSA

Folders and files

Latest commit

History

Repository files navigation

STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Todo:

Demo:

Multilingual Generation

Long Video Generation Compared with SOTA Methods

Inference:

Requirements

Prepare Environment

Quick Start

Training:

Dataset Pre-process

Stage1: Heatmap Predictor Training

Stage2: Face Synthesizer Training

Stage3: End-to-end Fine-tuning

Acknowledge:

Citation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages