Abstract
Multimodal Sentiment Analysis (MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality’s absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities’ features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.