You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
N is the total number of classes, and M is the class size.
Training
sh run.sh
Test
sh inference.sh
Weakness
The DiffTalk models talking head generation as an iterative denoising process, which needs more time to synthesize a frame compared with most GAN-based approaches. This is also a common problem of LDM-based works.
The model is trained on the HDTF dataset, and it sometimes fails on some identities from other datasets.
When driving a portrait with more challenging cross-identity audio, the audio-lip synchronization of the synthesized video is slightly inferior to the ones under self-driven setting.
During inference, the network is also sensitive to the mask shape in z_T , where the mask needs to cover the mouth region completely and its shape cannot leak any
lip shape information.
Acknowledgement
This code is built upon the publicly available code latent-diffusion. Thanks the authors of latent-diffusion for making their excellent work and codes publicly available.
Citation
Please cite the following paper if you use this repository in your research.
@inproceedings{shen2023difftalk,
author={Shen, Shuai and Zhao, Wenliang and Meng, Zibin and Li, Wanhua and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
title={DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation},
booktitle={CVPR},
year={2023}
}
About
[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"