Abstract

Speaker recognition is an important application of digital speech processing. However, a major challenge degrading the robustness of speaker-recognition systems is variation in the emotional states of speakers, such as happiness, anger, sadness, or surprise. In this paper, we propose a speaker recognition system corresponding to three states, namely emotional, neutral, and with no consideration for a speaker's state (i.e., the speaker can be in an emotional state or neutral state), for two languages: Arabic and English. Additionally, cross-language speaker recognition was applied in emotional, neutral, and (emotional + neutral) states. Convolutional neural network and long short-term memory models were used to design a convolutional recurrent neural network (CRNN) main system. We also investigated the use of linearly spaced spectrograms as speech-feature inputs. The proposed system utilizes the KSUEmotions, emotional prosody speech and transcripts, WEST POINT, and TIMIT corpora. The CRNN system exhibited accuracies as high as 97.4% and 97.18% for Arabic and English emotional speech inputs, respectively, and 99.89% and 99.4% for Arabic and English neutral speech inputs, respectively. For the cross-language program, the overall CRNN system accuracy was as high as 91.83%, 99.88%, and 95.36% for emotional, neutral, and (emotional + neutral) states, respectively.

Highlights

  • Identifying a person by their voice is an important human trait that is typically taken for granted in natural human-tohuman interactions/communications

  • 3) SPEAKER RECOGNITION IN ARABIC EMOTIONAL STATE In this experiment, spectrograms were extracted from all recorded audio files in Phase 2 of the KSUEmotions corpus for a total of 1,400 files corresponding to five emotions, namely neutral, sadness, happiness, surprise, and anger, for 14 speakers (7 males and 7 females)

  • The results generated by the designed convolutional recurrent neural network (CRNN) over 10 runs with similar system parameters are presented in Figure 5 for Arabic, English, and cross-language classification in three states of emotional, neutral, and

Read more

Summary

INTRODUCTION

Identifying a person by their voice is an important human trait that is typically taken for granted in natural human-tohuman interactions/communications. Human speech, which is a performance biometric, is different from other kinds of biometrics (such as hand geometry or fingerprints) [1] in that voice biometrics are the only commercial biometric product that process acoustic information [2]. The emotional state of a speaker significantly affects vocalization and speech emotion in the training set can differ from that in the test set, which can lead to system degradation. Studies have indicated that approximately 90% of human daily life is affected by different emotions, while only 10% is unemotional [9], [10] These emotions affect the speech production system by introducing changes in speech loudness, muscle tension, breathing rate, etc. In the language model level, high-level knowledge (i.e., emotionspecific clues) is included [11]

LITERATURE REVIEW
EXPERIMENTAL SETUP
CRNN MODEL
CRNN MODEL ARCHITECTURE
EVALUATION OF SYSTEM ACCURACY
RESULTS
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call