Abstract
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
Highlights
The affective content analysis of speech signals is an active area of investigation in this era
We proposed a simple and lightweight convolutional neural network (CNN) architecture with multiple layers using modified kernels and pooling strategy to detect the sensitive cues based on the extraction of the deep frequency features from the speech spectrograms, which tend to be more discriminative and reliable in speech emotion recognition
We practically prove our system which we tested over two benchmark
Summary
The affective content analysis of speech signals is an active area of investigation in this era. Speech is the greatest prevailing way to exchange information among human beings, and it is worth paying attention to human-computer interaction (HCI). The most significant factor in human speech is emotions, which can analyze for judgments about human expressions, paralanguages, and others. The speech signal is an efficient way for the fastest communication among HCI, which efficiently recognized human behavior. Emotion recognition in a speech signal is one of the fastest emerging research field, where researchers have developed methods to naturally detect emotions from a speech signal [1,2]. The theory of speech emotion recognition (SER) is beneficial for education and health, and it will be widely used in these fields once they are proposed [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.