You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SyncTalk_2D is a 2D lip-sync video generation model based on SyncTalk and Ultralight-Digital-Human. It can generate lip-sync videos with high quality and low latency, and it can also be used for real-time lip-sync video generation.
Compared to the Ultralght-Digital-Human, we have improved the audio feature encoder and increased the resolution to 328 to accommodate higher-resolution input video. This version can realize high-definition, commercial-grade digital humans.
Record a 5-minute video with your head facing the camera and without significant movement. At the same time, ensure that the camera does not move and the background light remains unchanged during video recording.
Don't worry about FPS, the code will automatically convert the video to 25fps.
No second person's voice can appear in the recorded video, and a 5-second silent clip is left at the beginning and end of the video.
Don't wear clothes with overly obvious texture, it's better to wear single-color clothes.
The video should be recorded in a well-lit environment.
The audio should be clear and without background noise.
Train
put your video in the 'dataset/name/name.mp4'
example: dataset/May/May.mp4
run the process and training script
bash training_328.sh name gpu_id
example: bash training_328.sh May 0
Waiting for training to complete, approximately 5 hours
If OOM occurs, try reducing the size of batch_size