Abstract
Speech emotion recognition (SER) is difficult since emotions are complex and dynamic processes involving multiple dimensions and sub-dimensions. Feature extraction is a challenging step in SER, where relevant features are extracted from the speech to identify emotional states accurately. Overcoming these challenges is essential to ensuring the effectiveness and robustness of the SER system. 3D-convolutional neural networks (3D-CNNs) are successfully used for feature extraction in SER. Speech signals can be converted into spectrogram-like representations where one axis represents time, another frequency, and the third can represent additional context or features. For modeling the dynamic nature of speech, a squeeze-and-excitation-based 3D-CNN model is employed for capturing temporal and spatial features of speech. Attention Gated Recurrent Units (AGRU) are applied to the extracted features for learning long-range temporal dependencies and selecting more informative features. The extraction and selection of the spatio-temporal feature representation lead to a hierarchical representation of the input speech. The fusion of squeeze-and-excitation 3D-CNN and AGRU (named SE3D-CARN) is evaluated on two datasets, EMO-BD and IEMOCAP, to identify various emotional states in speech. The proposed SER model reached an accuracy of 94.2% and 81.1% on the EMO-BD and IEMOCAP datasets, respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have