Multimodal sentiment prediction poses a formidable challenge that necessitates a profound understanding of both visual and linguistic cues, as well as the intricate interactions between them. The current achievements of modern systems in this domain can plausibly be attributed to the development of sophisticated cross-modal fusion techniques. Nevertheless, such solutions often handle each modality equally, neglecting the discordant predictions arising from sentiment incongruity in unimodal sources, which may result in performance degradation in conventional extraction-fusion scenarios. In this work, we take a different route–introducing an extraction-estimation-fusion paradigm aimed at exploring more reliable multimodal representations under the supervision of unimodal sentiment prediction. To this end, we propose a Cross-modal IncongruiTy pErception NETwork, named CiteNet, for multimodal sentiment detection. In CiteNet, we initially develop a cross-modal alignment module tailored to synchronize modality-specific representations through contrastive learning. Subsequently, with a refined cross-modal integration module, CiteNet can achieve a synergistic and comprehensive multimodal representation. In addition, we explore a cross-modal incongruity learning module from an information-theoretic perspective, capable of estimating inherent sentiment disparities by analyzing modal distributions. This incongruity score is then employed as a crucial factor in the adaptive fusion of unimodal and multimodal representations, culminating in enhanced accuracy in sentiment prediction. Experimental results on two datasets demonstrate that CiteNet outperforms prior methods by a significant margin of approximately 1%–11% in accuracy.
Read full abstract