Abstract

The availability of large-scale facial datasets with the rapid progress of deep learning techniques, such as Generative Adversarial Networks, has enabled anyone to create realistic fake videos. These fake videos can potentially become harmful when used for fake news, hoaxes, and identity fraud. We propose a deep learning bagging ensemble classifier to detect manipulated faces in videos. The proposed bagging classifier uses the convolution and self-attention network (CoAtNet) model as a base learner. CoAtNet model is vertically stacking depthwise convolution layers and self-attention layers in such a way that generalization, capacity, and efficiency are improved. Depthwise convolution captures local features from faces extracted from video then pass these features to the attention layers to extract global information and efficiently capture long-range dependencies of spatial details. Each learner is trained on a different subset randomly taken of training data with a replacement then models’ predictions are combined to classify the video either as real or fake. We also use CutMix data augmentation on the extracted faces to enhance the generalization and localization performance of the base learner model. Our experimental results show that our proposed method achieves higher efficiency compared to state-of-the-art methods with AUC values of 99.70%, 97.49%, 98.90%, and 87.62% on the different manipulation techniques of the FaceForensics++ dataset (DeepFakes (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT)), respectively, and 99.74% on the Celeb-DF dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call