Abstract

People are becoming accustomed to posting images and captions on social media platforms to express their opinions. Hence, Visual-Caption Sentiment Recognition (VCSR) has been a subject of growing attention recently. Thus, the correlation between visual and caption modalities is crucial for VCSR. However, most recent VCSR strategies concatenate features from the visual and caption modalities with the help of pre-trained deep learning models containing millions of trainable parameters without adding a dedicated attention module, ultimately leading to less desirable results. Motivated by this observation, we have proposed a novel model VABDC-Net, that integrates an attention module with the convolutional neural network to focus on the most relevant information from the visual modality and attentional tokenizer-based method to extract the most relevant contextual information from the caption modality. Demanding to this dire need, the following are the significant contributions of our experimentation: (1) an attentional tokenizer-based bi-directional caption branch to retrieve useful textual features from the captions, (2) an attentional visual branch to retrieve appropriate visual features, and (3) a cross-domain feature fusion to merge multi-modal features and predict sentiment. Thorough experimentation on two benchmark datasets, Twitter-15, with an accuracy of 83.80%, and Twitter-17, with an accuracy of 72.42%, indicates that our technique outperforms existing methods for VCSR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call