Squeeze-and-excitation 3D convolutional attention recurrent network for end-to-end speech emotion recognition

Nasir Saleem,Hela Elmannai,Sami Bourouis,Aymen Trigui

doi:10.1016/j.asoc.2024.111735

Nasir Saleem, Hela Elmannai + Show 2 more

https://doi.org/10.1016/j.asoc.2024.111735

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Speech emotion recognition (SER) is difficult since emotions are complex and dynamic processes involving multiple dimensions and sub-dimensions. Feature extraction is a challenging step in SER, where relevant features are extracted from the speech to identify emotional states accurately. Overcoming these challenges is essential to ensuring the effectiveness and robustness of the SER system. 3D-convolutional neural networks (3D-CNNs) are successfully used for feature extraction in SER. Speech signals can be converted into spectrogram-like representations where one axis represents time, another frequency, and the third can represent additional context or features. For modeling the dynamic nature of speech, a squeeze-and-excitation-based 3D-CNN model is employed for capturing temporal and spatial features of speech. Attention Gated Recurrent Units (AGRU) are applied to the extracted features for learning long-range temporal dependencies and selecting more informative features. The extraction and selection of the spatio-temporal feature representation lead to a hierarchical representation of the input speech. The fusion of squeeze-and-excitation 3D-CNN and AGRU (named SE3D-CARN) is evaluated on two datasets, EMO-BD and IEMOCAP, to identify various emotional states in speech. The proposed SER model reached an accuracy of 94.2% and 81.1% on the EMO-BD and IEMOCAP datasets, respectively.

Full Text