Abstract

AbstractAs humans, we express most naturally via speech. Speech emotion recognition systems are defined as a set of techniques for processing and classifying speech signals in order to detect the emotions that are inherent in them. For emotion recognition in audio files, a novel 1D convolutional neural network (Audio_EmotionModel) is designed, which contains 5 1D convolutional layers, 5 max pooling layers and 3 dropout layers. The Audio_EmotionModel is used for RAVDESS dataset for emotion recognition in audio datasets. For Emotion recognition in images (obtained by splitting videos into images), a novel 2D Convolutional neural network is designed (Image_EmotionModel), which contains 4 2D convolutional layers, 2 max pooling layers, 3 dropout layers and 2 batch normalization layers. The Image_EmotionModel is used for RAVDESS dataset for emotion recognition in video datasets (videos are converted into frames). The results clearly indicate, the two proposed models perform better than various state-of-art models. Human Recognition [29] achieved only 40.9% while recognizing the emotions, clearly the proposed Audio_EmotionModel outperformed the human recognition by nearly more than 50%.KeywordsEmotion recognitionSpeech recognitionConvolutional neural networkDeep learning

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.