Abstract
Recently, recognizing the emotional content of speech signals has received considerable research attention. Consequently, systems have been developed to recognize the emotional content of a spoken utterance. Achieving high accuracy in speech emotion recognition remains a challenging problem due to issues related to feature extraction, type, and size. Central to this study is increasing emotion recognition accuracy by porting the bag-of-word (BoW) technique from image to speech for feature processing and clustering. The BoW technique is applied to features extracted from Mel frequency cepstral coefficients (MFCC) which enhances feature quality. The study considers deployment of different classification approaches to examine the performance of the embedded BoW approach. The deployed classifiers include support vector machine (SVM), K-nearest neighbor (KNN), naive Bays (NB), random forest (RF), and extreme gradient boosting (XGBoost). In this study, experiments used the standard RAVDESS audio dataset with eight emotions: angry, calm, happy, surprised, sad, disgusted, fearful and neutral. The maximum accuracy obtained in the angry class using SVM was 85%, while overall accuracy was 80.1 %. The empirical works have proved that using BoW achieves better results in terms of accuracy and processing time compared to other available methods.
Highlights
Speech is a natural modality of human machine interaction
Speech emotion recognition systems have been used in forensic science, to investigate and detect criminals based on their speech and emotions [6]
Confusion matrices of the classification results for multiple classes using support vector machine (SVM), naive Bays (NB), K-nearest neighbor (KNN), random forest (RF) and XGBoost are shown in Fig. 3 for all data from speech and song
Summary
Speech is a natural modality of human machine interaction. The purpose of sophisticated speech systems should not be limited to message processing; rather they should understand the underlying intentions of the speaker by detecting expressions in speech [1]. Effective speech emotion recognition models should be able to recognize speakers’ emotion and perform the actions . There are some limitations that degrade most emotion recognition models for almost all existing emotional speech databases [7]. The primary issue that limits recognition accuracy is the lack of benchmarking databases that can be shared among researchers. Another issue is the lack of coordination among researchers in this field; the same mistakes in recording are being repeated for different emotional. The main advantages of using BoW are increased recognition accuracy and reduced processing time. The remainder of this paper is organized as follows: Section 2 discusses previous studies related to speech emotion recognition.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have