Abstract
Speech is one of the most promising features that reflects the underlying emotion of a human being. There are some measurable parameters in speech signals that reveal a persons affective state. Speech Emotion Recognition (SER) is a process of identifying the emotional elements in communication regardless of contextual relevance. Leveraging studies have taken place in this area. This paper proposes an ensemble model to automatically classify emotion from speech signals to one among the seven emotional classes neutral, calm, angry, sad, happy, fear, disgust, and surprised. In this work, speech spectral features have been extracted using Mel Frequency Cepstral Coefficient (MFCC). An emotion classification model based on 2-Dimensional Convolutional Neural Networks (2D-CNN) and eXtreme Gradient Boosting (XG-Boost) is proposed in this paper. This work also compares the performance of the proposed ensemble model with baseline models and other ensemble models. The accuracy of each model on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset is computed and the proposed model shows maximum accuracy in classifying emotions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.