Abstract
Speaker recognition is also called voiceprint recognition. The current state-of-the-art technology for speaker recognition is to use deep neural networks to extract features of the speaker's speech. This embedded feature extracted by DNN is generally called x-vector. Recently, resnet-based structures have received extensive attention and have gradually become the basis for speaker recognition research. In terms of model input, the most commonly used features include Linear Prediction Coefficient, Mel Frequency Cepstral Coefficient, Mel Filter Bank, and Spectrogram. However, a single feature cannot reveal all the features of speech. In this paper, we propose a text-independent speaker recognition algorithm based on fused features and x-vector architecture, in which we use LPC, F-bank and Spectrogram for acoustic features and fuse them at frame level, we use the currently popular ResNet as model for training and modify its structure, we use the additive angular margin loss for classification loss function. The experiments show that our proposed fusion feature and modified ResNet achieves remarkable Equal Error Rate of 0.9 for the VTCK dataset, which greatly improves the accuracy of speaker recognition.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.