Analysis on Speech-Emotion Recognition with Effective Feature Combination

Siddharth Patra,Sujoy Datta,Monideepa Roy

doi:10.1109/ocit56763.2022.00018

Abstract

Speech-Emotion Recognition, (SER) is the process of attempting to recognize the emotional aspects of speech and the affective states irrespective of the semantic contents of the speech. This is to make capital out of the fact that underlying emotions are often reflected in the voice of a person. While studying speech-emotion recognition, it is a pertinent issue to find the combination of the audio features that we can extract from the speech and see which combination falls into place perfectly with a suitable classification system. But having a well-defined database for speech analysis and research is as much important to SER study. Hence, we have used the RAVDESS dataset. In our study we have used acoustic features that can reflect well-defined and sharp changes in emotional expression; for this we have extracted features like Amplitude Envelope, RMS and more from the time-domain, Spectral Centroid, Spectral bandwidth and more from the frequency-domain and Mel-frequency cepstrum coefficients and more from the time-frequency domain. We have used the MLPClassifier for the classification of emotions. Our results show that a combination of MFCC, mel spectrogram and chroma is able to best explain the speech emotions through the MLPClassifier.

Full Text