Multimodal Reconstruct and Align Net for Missing Modality Problem in Sentiment Analysis

Wei Luo,Hanjiang Lai,Mengying Xu

doi:10.1007/978-3-031-27818-1_34

Wei Luo, Hanjiang Lai + Show 1 more

https://doi.org/10.1007/978-3-031-27818-1_34

Copy DOI

Export

Save

Cite

Publication Date: Jan 1, 2023

Citations: 8

Affiliation: Sun Yat-sen University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Multimodal Sentiment Analysis (MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality’s absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities’ features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.

Full Text