Abstract
With the massive social media data available online, the conventional single modality emotion classification has developed into more complex models of multimodal sentiment analysis. Most existing works simply extracted image features at a coarse level, resulting in the absence of partially detailed visual features. Besides, social media data usually contain multiple images, while existing works considered a single image case and used only one image for representing visual features. In fact, it is nontrivial to extend the single image case to the multiple images case, due to the complex relations among multiple images. To solve the above issues, in this paper, we propose a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">G</b> ated <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F</b> usion <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S</b> emantic <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R</b> elation (GFSR) network to explore semantic relations for social media sentiment analysis. In addition to inter-relations between visual and textual modalities, we also exploit intra-relations among multiple images, potentially improving the sentiment analysis performance. Specifically, we design a gated fusion network to fuse global image embeddings and the corresponding local Adjective Noun Pair (ANP) embeddings. Then, apart from textual relations and cross-modal relations, we employ the multi-head cross attention mechanism between images and ANPs to capture similar semantic contents. Eventually, the updated textual and visual representations are concatenated for the final sentiment prediction. Extensive experiments are conducted on real-world <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Yelp</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k</i> datasets, showing that our GFSR can improve about 0.10% to 3.66% in terms of accuracy on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Yelp</i> dataset with multiple images, and achieve the best accuracy for two classes and the best macro F1 for three classes on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k</i> dataset with a single image.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.