Emotions play a key role in determining the human mental state and indirectly express an individual’s well- being. A speech emotion recognition system can extract a person’s emotions from his/her speech inputs. There are some universal emotions such as anger, disgust, fear, happiness, pleasantness, sadness and neutral. These emotions are of significance especially in a situation like the Covid pandemic, when the aged or sick are vulnerable to depression. In the current paper, we examined various classification models with finite computational strength and resources in order to determine the emotion of a person from his/her speech. Speech prosodic features like pitch, loudness, and tone of speech, and work spectral features such as Mel Frequency Capstral Coefficients (MFCCs) of the voice were used to analyze the emotions of a person. Although sequence to sequence state of the art models for speech detection that offer high levels of accuracy and precision are currently in use, the computational needs of such approaches are high and inefficient. Therefore, in this work, we emphasised analysis and comparison of different classification algorithms such as multi layer perceptron, decision tree, support vector machine, and deep neural networks such as convolutional neural network and long short term memory. Given an audio file, the emotions that were exhibited by the speaker were recognized using machine learning and deep learning techniques. A comparative study was performed to identify the most appropriate algorithms that could be used to recognize emotions. Based on the experiment results, the MLP classifier and convolutional neural network model offered better accuracy with smaller variations when compared with other models used for the study.
Read full abstract