Abstract
In recent years, multimodal emotion recognition models are using pre-trained networks and attention mechanisms to pursue higher accuracy, which increases the training burden and slows down the training and inference speed. In order to strike a balance between speed and accuracy, this paper proposes a speed-optimized multimodal emotion recognition architecture for speech and text emotion recognition. In the feature extraction part, a lightweight residual graph convolutional network (ResGCN) is selected as the speech feature extractor, and an efficient RoBERTa pre-trained network is used as the text feature extractor. Then, an algorithm complexity-optimized sparse cross-modal encoder (SCME) is proposed and used to fuse these two types of features. Finally, a new gated fusion module (GF) is used to weight multiple results and input them into a fully connected layer (FC) for classification. The proposed method is tested on the IEMOCAP dataset and the MELD dataset, achieving weighted accuracies (WA) of 82.4% and 65.0%, respectively. This method achieves higher accuracy than the listed methods while having an acceptable training and inference speed.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of King Saud University - Computer and Information Sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.