You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VoxConverse is an audio-visual diarisation dataset consisting of multispeaker clips of human speech, extracted from YouTube videos.
Updates and additional information about the dataset can be found at our website.
Version 0.3
We have recently detected an error in some of our test rttm files. They are fixed in this master branch. Please use the 0.3 version for more accurate labels.
Version 0.2
If you want to see the previous version, please go to the ver0.2 branch in this repository.
Audio files
Dev set audio files can be downloaded from here.
Test set audio files can be downloaded from here
Speaker Diarisation annotations
Annotations are provided as Rich Transcription Time Marked (RTTM) files and can be found in the dev and test folder.
Citation
Please cite the following if you make use of the dataset.
@article{chung2020spot,
title={Spot the conversation: speaker diarisation in the wild},
author={Chung, Joon Son and Huh, Jaesung and Nagrani, Arsha and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle={Interspeech},
year={2020}
}
In order to obtain videos with a large amount of overlapping speech, we used data consisting of political debates and news segments. The views and opinions expressed by speakers in the dataset are those of the individual speakers and do not necessarily reflect positions of the University of Oxford, Naver Corporation, or the authors of the paper.
We would also like to note that the distribution of identities in this dataset may not be representative the global human population. Please be careful of unintended societal, gender, racial, linguistic and other biases when training or deploying models trained on this data.
About
Spot the conversation: speaker diarisation in the wild