Abstract

Existing visual–textual sentiment analysis methods usually get poor performance due to limited utilization of the correlation between different modalities, i.e., they neglect the heterogeneity and homogeneity of different modalities. To overcome these limitations, we propose a Multimodal Fusion Network (called MFN) with a multi-head self-attention mechanism. MFN can minimize noise interference between different modalities through neural networks and attention mechanisms to obtain independent visual and textual features. Furthermore, it can exploit correlations between fine-grained local region feature representations from multimodal with different numbers of hidden neurons to leverage complementary information from heterogeneous visual and textual data. Extensive experiments show MFN outperforms the 11 state-of-the-art methods by at least 0.11%, 0.13%, and 0.38% on Twitter, Flickr, and Getty image datasets, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call