Abstract
Recently a kind of deep speaker embedding called x-vector has been proposed. It is extracted from deep neural network and considered as a strong contender for next-generation representation for speaker recognition. However, training such DNNs requires a lot of data, usually thousands of hours. If we want to apply x-vector to Mandarin task but we only have a small amount of data, we can train DNNs on another language and fine-tune the DNNs with these little data. Firstly, we proposed a pure data driven method to transfer DNNs across languages and tasks. Secondly, we investigated the question that how to choose between training DNNs from scratch and reusing a pre-trained model by the transfer method we proposed. To answer the question, in this paper, we present the results of adapting a x-vector based speaker verification system from English to Mandarin by fine-tuning the front-end DNNs. We investigate the trend of performance improvement in these two training strategies when data size increasing. Experiment results show that adapting a pre-trained English model with a small amount of Mandarin data can easily reduce the equal error rate (EER). They also demonstrate that system trained from scratch is able to achieve better performance only when feed enough data. Finally, we test the performance of the two systems in noisy environment and found that system trained from scratch outperforms system fine-tuned with a pre-trained model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.