Speaker recognition is an important application of digital speech processing. However, a major challenge degrading the robustness of speaker-recognition systems is variation in the emotional states of speakers, such as happiness, anger, sadness, or surprise. In this paper, we propose a speaker recognition system corresponding to three states, namely emotional, neutral, and with no consideration for a speaker's state (i.e., the speaker can be in an emotional state or neutral state), for two languages: Arabic and English. Additionally, cross-language speaker recognition was applied in emotional, neutral, and (emotional + neutral) states. Convolutional neural network and long short-term memory models were used to design a convolutional recurrent neural network (CRNN) main system. We also investigated the use of linearly spaced spectrograms as speech-feature inputs. The proposed system utilizes the KSUEmotions, emotional prosody speech and transcripts, WEST POINT, and TIMIT corpora. The CRNN system exhibited accuracies as high as 97.4% and 97.18% for Arabic and English emotional speech inputs, respectively, and 99.89% and 99.4% for Arabic and English neutral speech inputs, respectively. For the cross-language program, the overall CRNN system accuracy was as high as 91.83%, 99.88%, and 95.36% for emotional, neutral, and (emotional + neutral) states, respectively.
Read full abstract