Abstract

With the development of multimodal sentiment analysis tasks, target-level/aspect-level multimodal sentiment analysis has received more attention, aiming to intelligently judge the sentiment orientation of target words using visual and textual information. Most existing methods mainly rely on combining the whole image and text while ignoring the implicit affective regions in the image. We introduce a novel affective region recognition and fusion network (ARFN) for target-level multimodal sentiment classification, which focuses more on the alignment of multimodal fusion of visual and textual. First, to produce a visual representation with sentiment elements, ARFN employs the Yolov5 algorithm to extract the object region of the image and selects the emotional area according to the strategy. Next, this method learns target-sensitive visual representations and text semantic representations through a multi-head attention mechanism and pre-trained models BERT, respectively. Moreover, ARFN fuses textual and visual representations through a multimodal interaction method to perform target-level multimodal sentiment classification tasks. We achieve state-of-the-art performance on two available multimodal Twitter datasets, and experimental results show the effectiveness of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call