Recent multi-modal fake news detection methods often use the consistency between textual and visual contents to determine the truth or fake of a news information. Higher levels of textual-visual consistency typically lead to a greater likelihood of classifying a news item as real. However, a critical observation reveals that creators of most fake news intentionally select images that align with the textual content, thereby enhancing the credibility of the news. Consequently, high consistency between textual and visual contents alone cannot guarantee the authenticity of the information. To address this problem, we introduce a novel approach termed Multimodal Consistency-based Suppression Factor to modulate the significance of textual-visual consistency in information assessment. When the textual-visual matching is high, this suppression factor reduces the influence of consistency during the judgment process. Moreover, we use Contrastive Language-Image Pre-training (CLIP) model to extract features and measure the consistency level between modalities to guide multimodal fusion. In addition, we also use a method of compressing and fusing modal information based on Variational Autoencoder (VAE) to reconstruct CLIP features, learning the shared representation of different modal information of CLIP. Finally, extensive experiments were conducted on three publicly datasets, Weibo, Twitter and Weibo21, and the results confirmed that our method outperformed the state-of-the-art methods in the field, and had 0.8%, 2.6% and 4.1% effect improvement on the accuracy rate.
Read full abstract