A multimodal fusion network with attention mechanisms for visual–textual sentiment analysis

Qingdong Feng,Xiang Fu,Yang Cao,Ye Zhu,Chenquan Gan,Qingyi Zhu

doi:10.1016/j.eswa.2023.122731

Abstract

Existing visual–textual sentiment analysis methods usually get poor performance due to limited utilization of the correlation between different modalities, i.e., they neglect the heterogeneity and homogeneity of different modalities. To overcome these limitations, we propose a Multimodal Fusion Network (called MFN) with a multi-head self-attention mechanism. MFN can minimize noise interference between different modalities through neural networks and attention mechanisms to obtain independent visual and textual features. Furthermore, it can exploit correlations between fine-grained local region feature representations from multimodal with different numbers of hidden neurons to leverage complementary information from heterogeneous visual and textual data. Extensive experiments show MFN outperforms the 11 state-of-the-art methods by at least 0.11%, 0.13%, and 0.38% on Twitter, Flickr, and Getty image datasets, respectively.

Full Text