The rising popularity of diverse social media platforms, commonly utilized by individuals to articulate their emotions in everyday interactions, has spurred a growing interest in the task of multi-modal sarcasm detection (MSD). Nonetheless, due to the unique nature of sarcasm, there remain two main challenges to achieving robust MSD. Firstly, prevailing methodologies often overlook the issue of weak multi-modal correlation, thereby neglecting the crucial sarcasm cues that are inherent in each uni-modal source. Secondly, there are inefficiencies in modeling cross-modal interactions in unaligned multi-modal data. To tackle these challenges, we introduce a multi-granularity information fusion network for multi-modal sarcasm detection. Specifically, we design a multi-task CLIP framework that can utilize multi-granularity cues from multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection. Furthermore, we devise a global-local cross-modal interaction learning method that uses discourse-level representations from each modality as the global multi-modal context to engage with local uni-modal features. This enables the global multi-modal context and local uni-modal features to mutually enhance and progressively improve through a multi-layer superposition. Following extensive experimental results and thorough analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.
Read full abstract