Abstract

Multimodal sentiment analysis has currently identified its significance in a variety of domains. For the purpose of sentiment analysis, different aspects of distinguishing modalities, which correspond to one target, are processed and analyzed. In this work, we propose the targeted aspect-based multimodal sentiment analysis (TABMSA) for the first time. Furthermore, an attention capsule extraction and multi-head fusion network (EF-Net) on the task of TABMSA is devised. The multi-head attention (MHA) based network and the ResNet-152 are employed to deal with texts and images, respectively. The integration of MHA and capsule network aims to capture the interaction among the multimodal inputs. In addition to the targeted aspect, the information from the context and the image is also incorporated for sentiment delivered. We evaluate the proposed model on two manually annotated datasets. the experimental results demonstrate the effectiveness of our proposed model for this new task.

Highlights

  • Sentiment analysis, referred to as sentiment classification, aims to extract opinions from a large number of unstructured texts and classifying them into sentiment polarities, positive, neutral or negative [1]

  • On current shopping and social platforms, seeing that the text and image information is taken to mutually reinforce and complement each other, models are dedicatedly devised to classify the sentiment polarity by using both kinds of data and their latent relation [5]. Recent publications report their achievements on the task of multimodal sentiment analysis

  • Unlike the previous approach of bilinear pooling, we use multi-head attention network for multimodal feature fusion, because the multi-head attention mechanism can focus on the interaction of textual and visual modality in different facet, This helps the model to capture more inter-modality correlation information

Read more

Summary

INTRODUCTION

Referred to as sentiment classification, aims to extract opinions from a large number of unstructured texts and classifying them into sentiment polarities, positive, neutral or negative [1]. On current shopping and social platforms, seeing that the text and image information is taken to mutually reinforce and complement each other, models are dedicatedly devised to classify the sentiment polarity by using both kinds of data and their latent relation [5] Recent publications report their achievements on the task of multimodal sentiment analysis. Unlike the previous approach of bilinear pooling, we use multi-head attention network for multimodal feature fusion, because the multi-head attention mechanism can focus on the interaction of textual and visual modality in different facet, This helps the model to capture more inter-modality correlation information. Yu et al proposed a Multimodal BERT architecture, which adapts BERT for cross-modal interaction to obtain target-sensitive textual/visual representations and utilize stacked multiple self-attention layers to achieve multi-modal fusion[5]. Where X indicates a general input of the MHA network

FEATURE EXTRACTING LAYER
Method
CASE STUDY
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call