Abstract

The performance of a speaker recognition system depends highly on which acoustic features are used. Most speaker recognition systems use short-term acoustic features extracted from a single speech frame, and the most popular short-term acoustic features are the Mel-frequency cepstral coefficients (MFCCs). The short-term features are generally static features no dynamic information in the speech signal is included in either cepstral coefficients or an MFCCs frame. Using an analysis sparse representation model, in this paper, we introduce the long-term acoustic (LTA) feature for text-independent speaker recognition, which is a sparse presentation of the static features and dynamic information for the speaker’s speech. First, the speech signal is segmented into frames which are overlapping with each other, and then the MFCCs frame features can be extracted to construct some super MFCCs frames by stacking some following frames of the current frame to capture the dynamic information of the speech signal. The super MFCCs frames can be combined into a 2-D MFCCs features map (MFCCsmap). Finally, the speaker model can be built based on the analysis sparse model and the sparse representations of the MFCCsmap are used as the LTA features. A state-of-the-art deep neural network (DNN) is employed as a classifier for speaker recognition. The experimental results illustrate the effectiveness and robustness of the proposed system.

Highlights

  • Speaker recognition is the process of identifying a person based on the voice of the speaker [1]

  • We present the long-term acoustic (LTA) features including the static and dynamic information of the speech signal, which is obtained by using the analysis sparse representations of the MFCCsmap with the speaker model, and the LTA features are used as the input of the deep neural network (DNN) classifier

  • The MFCCsmap of the test speech is obtained in the same way as in the training phase, and the long-term acoustic features are generated by the speaker model, and the LTA features are utilized as the input for the trained DNN classifier to realize the speaker recognition

Read more

Summary

INTRODUCTION

Speaker recognition is the process of identifying a person based on the voice of the speaker [1]. On the basis of the analysis sparse model, the sparse representations of the super MFCCs frames could be used as long-term acoustic (LTA) features with static and dynamic information of the speech signal. We present the LTA features including the static and dynamic information of the speech signal, which is obtained by using the analysis sparse representations of the MFCCsmap with the speaker model, and the LTA features are used as the input of the DNN classifier. The MFCCsmap of the test speech is obtained in the same way as in the training phase, and the long-term acoustic features are generated by the speaker model, and the LTA features are utilized as the input for the trained DNN classifier to realize the speaker recognition

MFCCSMAP
SPEAKER MODEL AND LONG-TERM ACOUSTIC FEATURES
DNN CLASSIFIER
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call