EERCA-ViT: Enhanced Effective Region and Context-Aware Vision Transformers for image sentiment analysis

Xiaohua Wang,Jie Yang,Min Hu,Fuji Ren

doi:10.1016/j.jvcir.2023.103968

Abstract

Different parts of an image have a strong or weak guiding effect on emotions. The key to emotion recognition of images is to fully exploit the regions associated with emotions. Therefore, this paper proposes a visual sentiment classification model with two branches based on visual transformer, termed as Enhanced Effective Region and Context-Aware Vision Transformers (EERCA-ViT). This model includes a primary branch and an auxiliary branch. The primary branch simulates interdependencies between patches by squeezing and stimulating patches (P-SE), thereby highlighting effective region features in the global feature. The auxiliary branch removes feature patches that have been tagged by the primary branch through the context-aware module (CAM), forcing the network to discover different sentiment discriminative regions. At the same time, a two-part loss function is constructed to improve the robustness of the model. Finally, extensive experiments on six benchmark datasets show that the proposed method outperforms the state-of-the-art image sentiment analysis methods. Furthermore, the effectiveness of the different modules of the framework (P-SE and CAM) has been well demonstrated through extensive comparative experiments.

Full Text