Abstract

Disinformation created by artificial neural networks has been widespread along with the recent rapid progress in multimodal learning, and the arising of vision-language foundation models. This disinformation caused a substantial negative impact on society. To solve this pressing issue, numerous efforts have been made to detect either image deepfake or text manipulation. These methods generally focus on a single modality while ignoring the complementary knowledge provided by the counterpart in the other modalities. In this paper, we aim to detect multimodal disinformation and further identify manipulated image areas or text tokens. To this aim, a novel framework termed Vision-language Knowledge Interaction (ViKI) is designed to explore the semantic correlation of an object in different modalities. Specifically, we propose a vision-language embedding regulator to build a joint feature space in which the embeddings of the same semantic are well-aligned. Besides, we perform cross-modality knowledge interaction so as to aggregate uni-modality embedding by adaptively injecting cross-modality information. By exploring vision-language knowledge jointly, ViKI produces accurate predictions for detecting and grounding disinformation. We demonstrate the superiority of ViKI by ablation studies and comparisons with the state-of-the-art methods on large-scale benchmarks. Notably, ViKI outperforms the state-of-the-art works by a rise of 3.71% in precision and 2.14% in CF1 respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call