Multi-modal sarcasm detection (MSD) presents a formidable and intricate endeavor. Despite strides made by extant models, two principal hurdles persist: Firstly, prevailing methodologies merely address superficial disparities between textual inputs and associated images, neglecting nuanced inter-modal combinations. Secondly, satirical instances frequently involve intricate emotional expressions, highlighting the imperative of leveraging emotional cues across modalities to discern satirical nuances. Accordingly, this research proposes the utilization of a deep graph convolutional network that integrates cross-modal mapping information to effectively identify significant incongruent sentiment expressions across various modalities for the purpose of multi-modal sarcasm detection. Specifically, we first design a cross-modal mapping network, which obtains the interaction information between these two modalities by mapping text feature vectors and image feature vectors two by two to compensate for the lack of multi-modal data in the fusion process. Additionally, we employ external knowledge of ANPS as a bridge to construct cross-correlation graphs through highly correlated satirical cues and their connection weights between image and text modalities. Afterward, the GCN architecture with retrieval-based attentional mechanisms will effectively capture satirical cues. The experiments conducted on two publicly available datasets demonstrate a significant enhancement in the performance of our method when compared to numerous contemporary models.
Read full abstract