Abstract

Speaker identification has recently attracted considerable attention in speaker recognition. Environmental noise and short utterance pose two challenges for accurate speaker identification. In this paper, a network model with new feature extraction methods and a new bi-directional long short-term memory network is proposed to identify the speaker. Specifically, this paper proposes to combine the mel-spectrogram and cochleagram to generate two new features, named MC-spectrogram and MC-cube. They have stronger robustness and can obtain more abundant voiceprint feature in the short utterance. Then, multi-dimensional CNNs are applied to process MC-spectrogram and MC-cube features correspondingly. They contain multi-dimensional convolution kernels, which can learn the voiceprint features more efficiently. In addition, the context information is ignored by CNN. And the forward voiceprint features are more crucial because the voiceprint features concentrate on the back part in the short utterance. Asymmetric bi-directional long short-time memory network (ABLSTM) is proposed to further learn the voiceprint features in global feature learning. It can improve the accuracy of speaker identification. According to the diverse dimension of input, the proposed network model can manifest diverse patterns, which are named Audio-1DCNN-ABLSTM, MCS(MC-spectrogram)-2DCNN-ABLSTM and MCC(MC-cube)-3DCNN-ABLSTM. From the experimental results, it is shown that the diverse patterns can achieve superior accuracy and robustness in the short utterance with extra environmental noise. Furthermore, the proposed network model provides a reliable solution in text-independent speaker identification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call