Abstract
In recent years, convolutional neural networks have been widely applied in the research of speech emotion recognition. However, most existing methods treat speech spectrogram features as images to learn the emotion features, which makes it difficult to capture rich emotion information at different scales. In this paper, a speech emotion recognition method based on multi-dimensional feature extraction and multi-scale feature fusion is proposed. Multi-dimensional features are extracted from Mel-frequency cepstral coefficients using parallel convolutional layers, emotion feature representations are learned at different scales using multi-scale residual convolutional layers, and a global feature fusion module is finally used to integrate local multi-scale emotion features to obtain key global emotion features. Experiments are conducted on the IEMOCAP, CREMA-D, and RAVDESS dataset, and the results show that our proposed network achieves a 3.5%-4% improvement in four commonly used evaluation metrics, including unweighted accuracy and weighted accuracy, compared with the state-of-the-art methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.