Abstract

In the past decade, Speech Emotion Recognition (SER) in many spoken languages has become a field of growing interest. MFCCs (Mel Frequency Cepstrum Coefficients) are commonly utilized representations for audio classification, and are now becoming a prominent feature in SER systems. However, in the view of a performance analysis, there exists another feature named PCEN (Per Channel Energy Normalization) that has proven to outperform MFCCs in the context of speech. In order to compare the performances of the MFCC and PCEN, they have individually been used as inputs into a one dimensional Convolutional Neural Network (CNN). The samples from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) were utilized. Furthermore, the framework proposed in this paper obtains an accuracy of 85.3% for the configuration that utilizes PCEN, 77.4% for the configuration that uses only the MFCCs as inputs, and 78.1% that combines both.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call