| CARVIEW |
Select Language
Comparison with UNIVERSE diffusion model
Additionally, we provide comparison of our model with UNIVERSE on their validation data. Our model is less prone to hallucinations,
while delivering high perceptual quality.
Input |
Ground Truth |
![]() |
![]() |
Finally (ours) |
UNIVERSE |
![]() |
![]() |
Input |
Ground Truth |
![]() |
![]() |
Finally (ours) |
UNIVERSE |
![]() |
![]() |
Input |
Ground Truth |
![]() |
![]() |
Finally (ours) |
UNIVERSE |
![]() |
![]() |
Additional Comparison with HiFi-GAN-2 and UNIVERSE
Input |
HiFi-GAN-2 |
Ours |
Input |
UNIVERSE |
Ours |
Ground truth |
Examples of clusters obtained during LMOS studies
As we mentioned in our paper, we generated clusters with the help of VITS.
In this part we provide the examples of different clusters.
As it can be heard, the diversity of samples from one cluster is not caused by the phrase, speaker or phoneme duration mismatch.
WavLM tends to preserve this structure whilst the L2 distance, for instance, usually not.
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 |
Data
To ensure a fair comparison with our work, we provide samples from five datasets enhanced by our model:- VoxCeleb: 50 audio clips from VoxCeleb1 (Nagrani et al., 2017), covering the Speech Transmission Index (STI) range of 0.75-0.99, balanced between male and female speakers. Link
- UNIVERSE: 100 audio clips randomly generated by the authors of UNIVERSE (SerrĂ et al., 2022) from clean utterances sampled from VCTK and Harvard sentences, alongside noises from DEMAND and FSDnoisy18k. The data includes various simulated distortions like band limiting, reverberation, codec, and transmission artifacts. For more details, refer to (SerrĂ et al., 2022). Link
- VCTK-DEMAND: Validation samples from the Valentini denoising benchmark (Valentini-Botinhao et al., 2017). This dataset facilitates broad comparisons across various speech enhancement models, with a test set of 824 utterances containing artificially simulated noisy samples from 2 speakers at 4 SNR levels (17.5, 12.5, 7.5, and 2.5 dB). Link
- LibriTTS: A multi-speaker corpus of English speech at 24kHz sampling rate, originally intended for TTS. We provide an enhanced version for 100 randomly selected samples from the test-other set. Link
- Deep Noise Suppression Challenge: We provide enhanced version for
dns5-blind-testsetdata for both headset and non-headset tracks. For more information about the challenge, please, refere to the DNS github page. Our enhanced data is available through the link.
BibTeX
@misc{babaev2024finallyfastuniversalspeech,
title = {{FINALLY}: fast and universal speech enhancement with studio-like quality},
author = {Babaev, Nicholas and Tamogashev, Kirill and Saginbaev, Azat and Shchekotov, Ivan and Bae, Hanbin and Sung, Hosang and Lee, WonJun and Cho, Hoon-Young and Andreev, Pavel},
year = {2024},
url = {https://arxiv.org/abs/2410.05920},
eprint = {2410.05920},
archiveprefix = {arXiv},
primaryclass = {cs.SD}
}

















