You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repo provides a code for estimation of DeepSpeech Distances, new evaluation metrics for neural speech synthesis.
Details
The computation involves estimating Fréchet and Kernel distances between high-level features of the reference and the examined samples extracted from hidden representation of NVIDIA's DeepSpeech2 speech recognition model.
We propose four distances:
Fréchet DeepSpeech Distance (FDSD, based on FID, see [2])
Kernel DeepSpeech Distance (KDSD, based on KID, see [3])
conditional Fréchet DeepSpeech Distance (cFDSD),
conditional Kernel DeepSpeech Distance (cKDSD).
The conditional distances compare samples with the same conditioning (e.g. text) and asses conditional quality of the audio. The uncoditional ones compare random samples from two distributions and asses general quality of audio. For more details, see [1].
After that, go to /content/drive/My Drive/DeepSpeechDistances, open a demo notebook deep_speech_distances.ipynb, and follow the instructions therein.
Notes
We provide a tensorflow meta graph file for DeepSpeech2 based on the original one available with the checkpoint. The provided file differs from the original only in the lack of map-reduce ops defined by horovod library; therefore the resulting model is equivalent to the original.
This is an 'alpha' version of the API; although fully functional it will be heavily updated and simplified soon.