Abstract

Human language is multimodal, including textual, visual and acoustic information. The task of multimodal sentiment analysis is to use human multimodal information for sentiment recognition. Among the three modalities, text contains richer information than other modalities. With the development of pre-trained representation models on text, most of multimodal sentiment analysis methods use text as primary information and the other modalities as supplementary information. The existing methods suffer from the following limitations: 1) inherent heterogeneity of multimodal data, which makes multimodal fusion difficult as different modalities reside in different feature spaces; 2) asynchronism caused by the inconsistent sampling rates of the time series data of different modalities. To alleviate the heterogeneity and asynchronism of multimodal data, we propose HMAI-BERT, a hierarchical multimodal alignment and interaction network-enhanced BERT. In HMAI-BERT, to improve the efficiency of multimodal interaction, we introduce a memory network to align the different multimodal representations before fusion. After multimodal alignment, we propose a modal update method to address the problem of asynchronism, where each modality is reinforced by interacting with other modalities. In addition, we introduce a fusion module to integrate the three reinforced modalities, and a sentiment enhanced memory to enhance multimodal representation. Our experiments on two public datasets show that the proposed HMAI-BERT outperforms the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call