Abstract

Automatic emotion recognition from speech is a demanding and challenging problem. It is difficult to differentiate between the emotional states of humans. The major problem with this task is to extract the important features from the speech in case of hand-crafted features. The accuracy for emotion recognition can be increased using deep learning approaches which use high level features of speech signals. In this work, an algorithm is proposed using deep learning to extract the high-level features from raw data with high accuracy irrespective of language and speakers (male/females) of speech corpora. For this, the .wav files are converted into the RGB spectrograms (images) and normalized to size (224x224x3) for fine-tuning these for Deep Convolutional Neural Network (DCNN) to recognize emotions. DCNN model is trained in two stages. From stage-1 the optimal learning rate is identified using the Learning Rate (LR) range test and then the model is trained again with optimal learning rate in stage-2. Special stride is used for down-sampling the features with reduced model size. The emotions considered are happiness, sadness, anger, fear, disgust, boredom/surprise and neutral. The proposed algorithm is tested on three popular public speech corpora EMODB (German), EMOVO (Italian), and SAVEE (British English). The accuracy of emotion recognition reported is better as compared to the existing studies for different languages and speakers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call