Abstract
Multimodal Sentiment Analysis (MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality’s absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities’ features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have