Recently, the higher availability of multimedia content on social websites, together with lightweight deep learning (DL) empowered tools like Generative Adversarial Networks (GANs) has caused the generation of realistic deepfakes. Such fabricated data has the potential to spread disinformation, revenge porn, initiate monetary scams, and can result in adverse immoral and illegal societal issues, etc. Hence, the accurate identification of deepfakes is mandatory to discriminate between real and manipulated content. In this work, we have presented a DL-based approach namely a unified network for FaceSwap (FS) and Face-Reenactment (FR) Deepfakes Detection (AUFF-Net). More clearly, both the spatial and temporal information from the video samples are used to detect two types of visual manipulations i.e., FS and FR. For this reason, a novel DL framework namely the Inception-Swish-ResNet-v2 model is introduced as a feature extractor for computing the information at the spatial level. While the Bi-LSTM model is utilized to measure the temporal information. Additionally, 3 dense layers are included at the last of the model structure to suggest a discriminative group of the feature vector We performed extensive experimentation on a challenging dataset namely the FaceForensic++, and attainede average accuracy values of 99.21 %, and 98.32 % for FS, or FR, respectively. Furthermore, we introduced an explainability module to show the reliable keypoints selection capability of our technique. Moreover, we have performed a cross-dataset evaluation to show the generalization power of our approach. Both the qualitative and quantitative results have confirmed the effectiveness of the suggested approach for visual manipulation categorization under the occurrence of various adversarial attacks.
Read full abstract