A Deep Speaker Embedding Transfer Method for Speaker Verification

Kai Zhou,Shaohan Liu,Xiusong Sun,Qun Yang

doi:10.1007/978-3-030-32456-8_40

Abstract

Recently a kind of deep speaker embedding called x-vector has been proposed. It is extracted from deep neural network and considered as a strong contender for next-generation representation for speaker recognition. However, training such DNNs requires a lot of data, usually thousands of hours. If we want to apply x-vector to Mandarin task but we only have a small amount of data, we can train DNNs on another language and fine-tune the DNNs with these little data. Firstly, we proposed a pure data driven method to transfer DNNs across languages and tasks. Secondly, we investigated the question that how to choose between training DNNs from scratch and reusing a pre-trained model by the transfer method we proposed. To answer the question, in this paper, we present the results of adapting a x-vector based speaker verification system from English to Mandarin by fine-tuning the front-end DNNs. We investigate the trend of performance improvement in these two training strategies when data size increasing. Experiment results show that adapting a pre-trained English model with a small amount of Mandarin data can easily reduce the equal error rate (EER). They also demonstrate that system trained from scratch is able to achieve better performance only when feed enough data. Finally, we test the performance of the two systems in noisy environment and found that system trained from scratch outperforms system fine-tuned with a pre-trained model.

Full Text