PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
We observe that the CNN-based speech encoder works well. To simplify overall pipeline, we provide our Global Emotion/Speaker Encoder re-implementation and pretrained models as open source in this branch.
For implementation of the wav2vec 2.0 global encoder, please refer to the wav2vec branch here.
Download the pretrained model we provide here.
Model | Discription |
---|---|
Encoder | Global Emotion Encoder |
For other datasets, please fine-tune the pretrained models for better results.
from encoder import inference as EmotionEncoder
from encoder.inference import embed_utterance as Embed_utterance
from encoder.inference import preprocess_wav
emo_encoder.load_model(path/to/emotion_encoder)
processed_wav = preprocess_wav(path/to/wave)
emo_embed = Embed_utterance(processed_wav)
from resemblyzer import VoiceEncoder, preprocess_wav
spk_encoder = VoiceEncoder()
processed_wav = preprocess_wav(path/to/wave)
spk_embed = encoder.embed_utterance(processed_wav)
Python 3.6 +.
Run pip install -r requirements.txt
to install the necessary packages.
A GPU is mandatory, but you don't necessarily need a high tier GPU if you only want to use the toolbox.
Ideally, all your datasets are kept under a same directory i.e., <datasets_root>
. All prepreprocessing scripts will, by default, output the clean data to a new directory SV2TTS created in your datasets root directory. Inside this directory will be created a directory for the encoder.
python encoder_preprocess.py <datasets_root>
python encoder_train.py my_run <datasets_root>/SV2TTS/encoder
python generate_embeddings.py
This implementation uses parts of the code from the following Github repos: Real-Time-Voice-Cloning, NATSpeech, as described in our code.
If you find this code useful in your research, please cite our work:
@article{huang2022generspeech,
title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis},
author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
journal={Advances in Neural Information Processing Systems},
year={2022}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.