Triple-modality interaction for deepfake detection on zero-shot identity

Junho Yoon,Angel Panizo-Lledot,David Camacho,Chang Choi

doi:10.1016/j.inffus.2024.102424

Abstract

Recent advancements in generative AI technology have created more realistic fake data that are utilized in various fields, such as data augmentation. However, the misuse of deepfake technology has led to increased damage. Consequently, ongoing research aims to analyze modality characteristics and detect deepfakes through AI-based methods. Existing AI-based deepfake-detection techniques have limitations in detecting deepfakes in modalities and identities that are not included in the training data. This study proposes a baseline approach based on zero-shot identity and one-shot deepfake detection for detecting deepfakes in environments with limited data. Additionally, we propose a triple-modality interaction based on a multimodal transformer (TMI-Former) to consider the triple-modality aspects of deepfakes. TMI-Former comprises four stages: vision feature extraction, representation, residual connection, and late-level fusion. It operates in a two-stage manner, extracting visual features and reconstructing them using auditory and linguistic features, thereby allowing for triple-modality interactions. In environments with limited data, such as zero-shot identity and one-shot deepfake scenarios, TMI-Former demonstrated effectiveness, with an accuracy ranging from 18.75% to 19.5% and an f1-score ranging from 0.2238 to 0.3561, compared to unimodal AI. Furthermore, TMI-Former shows superior performance compared to the existing multi-modal AI, with an accuracy ranging from 1.44% to 19.75% and an f1-score ranging from 0.0146 to 0.4169.

Full Text