Speech Emotion Recognition Using 1D CNN with No Attention

Yulan Li,Ting Cai,Goodlet A Kusi,Charlesetta Baidoo

doi:10.1109/icsec47112.2019.8974716

Abstract

Speech emotion recognition (SER) has gained much attention in recent years. SER system may be efficient depending on how much useful information contained in the extracted emotional features. Many research works have achieved state-of-the-art results using Convolutional Neural Network with different extracted speech features. These kinds of models can’t collect relative emotional salient features from speech signal. In this paper, we present a novel complementary feature extraction method to extract salient emotional features. We compute Melspectrogram and Mel Frequency Cepstral Coefficients (MFCC) to capture time-frequency domain information, aimed at converting raw speech into emotional informative features from speech signals. Moreover, we adopt complementary property strategy to extract features and construct 1D CNN model which selects emotional features effectively and evaluate the model’s performance on IEMOCAP, RAVDESS and Emo-DB speech corpus. Our method achieves better performance than baselines and competitive results using complementary features as input.

Full Text