Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

Lingli Yu,Fengjun Xu,Yundong Qu,Kaijun Zhou

doi:10.1016/j.apacoust.2023.109752

Abstract

In recent years, convolutional neural networks have been widely applied in the research of speech emotion recognition. However, most existing methods treat speech spectrogram features as images to learn the emotion features, which makes it difficult to capture rich emotion information at different scales. In this paper, a speech emotion recognition method based on multi-dimensional feature extraction and multi-scale feature fusion is proposed. Multi-dimensional features are extracted from Mel-frequency cepstral coefficients using parallel convolutional layers, emotion feature representations are learned at different scales using multi-scale residual convolutional layers, and a global feature fusion module is finally used to integrate local multi-scale emotion features to obtain key global emotion features. Experiments are conducted on the IEMOCAP, CREMA-D, and RAVDESS dataset, and the results show that our proposed network achieves a 3.5%-4% improvement in four commonly used evaluation metrics, including unweighted accuracy and weighted accuracy, compared with the state-of-the-art methods.

Full Text