Abstract
In view of the singleness of traditional text emotion analysis, multimodal emotion analysis extends emotion analysis to the analysis level of text, image, sound and so on. In order to use the context interaction information expressed in each modality, a multimodal interactive emotion classification model based on video context is proposed in this paper. The ALBERTBiGRU network structure is built for text feature learning, and the independent BIGRU model is used to extract context features from text, audio and video modality. Based on the attention mechanism, the emotion analysis task is completed after the fusion of the three modal features. Compared with the existing models on MOSI and IEMOCAP data sets, the accuracy and F1 value of emotion classification reached 81.71% and 81.44 on MOSI data set, 66.97% and 67.20 on IEMOCAP data set, which were 1.41% and 2.16% higher than the highest benchmark value respectively, effectively improving the accuracy of multimodal emotion prediction.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.