AbstractThe primary challenge of the multimodal sentiment analysis (MSA) task is the modal fusion, and the lack of modalities may exist in the fusion process, which leads to poor prediction results. Most of the previous research on multimodal fusion is single‐stage fusion, disregarding how various modality subsets interact, as well as rarely considering relative position relationship of modality sequences, causing the fragmentation of context info. Considering the aforementioned issues, this study introduces an MSA method based on the multi‐stage graph fusion network (MSGFN) under random missing modality conditions to improve the robustness of the model to MSA under the random missing modality conditions. Firstly, for each modality, its inter‐modal and intra‐modal multi‐head attention are used to learn robust representation of the modality sequence. Meanwhile, the relative position encoding (RPE) is introduced into mechanism of attention that enables model to perceive and learn the relative position before and after the modality sequence when calculating attention, thereby better understanding the contextual info of the sequence. After that, the transformer encoder receives the learned modality features and uses the pre‐trained model to supervise the reconstruction of the missing modality information. Finally, the feature representations of different modalities have effectively fused using multi‐stage graph fusion network, and the output is used for the ultimate sentiment classification. Wide experiments are conducted on two publicly available datasets, CMU‐MOSI and IEMOCAP, and the findings indicate that the proposed method can better handle the challenges caused by modality fusion and modality missing compared with several baseline methods.
Read full abstract