ViXNet: Vision Transformer with Xception Network for deepfakes based video and image forgery detection

Shreyan Ganguly,Aditya Ganguly,Sk Mohiuddin,Samir Malakar,Ram Sarkar

doi:10.1016/j.eswa.2022.118423

Shreyan Ganguly, Aditya Ganguly + Show 3 more

https://doi.org/10.1016/j.eswa.2022.118423

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

With the advent of image generative technologies, there is a huge growth in the development of facial manipulation techniques that allow people to easily modify media data like videos and images by changing the identity or facial expression of the target person with another person’s face. Colloquially, these manipulated videos and images are termed “deepfakes”. As a result, every piece of content in digital media comes with a question — is this authentic? Hence, there is an unprecedented need for a competent deepfakes detection method. The rapid changes in forging methods make this a very challenging task and thus generalization of the detection methods is also of utmost required. However, the generalization strengths of the prevailing deepfakes detection methods are not satisfactory. In other words, these models perform well when trained and tested on the same dataset but fail to perform satisfactorily when models are trained on one dataset and tested on another. The most modern deep learning aided deepfakes detection techniques looked for a consistent pattern among the leftover artifacts in specific facial regions of the target face rather than the entire face. To this end, we propose a Vision Transformer with Xception Network (ViXNet) to learn the consistency of these almost imperceptible artifacts left by deepfaking methods on the entire facial region. The ViXNet comprises two branches — one tries to learn inconsistencies among local face region specifics by combining patch-wise self-attention module and vision transformer, and the other generates global spatial features using a deep convolutional neural network. To assess the performance of ViXNet, we evaluate it using two different experimental setups — intra-dataset and inter-dataset when using three standard deepfakes video datasets, namely FaceForensics++, and Celeb-DF (V2) and one deepfakes image dataset called Deepfakes. We have attained 98.57% (83.60%), 99.26% (74.78%), and 98.93% (75.13%) AUC scores using intra(inter)-dataset experimental setups on FaceForensics++, Celeb-DF (V2), and Deepfakes datasets respectively. Additionally, we have evaluated ViXNet on the Deepfake Detection Challenge (DFDC) dataset and we have obtained 86.32% AUC score and 79.06% F1-score on the said dataset. Performances of the proposed model are comparable to state-of-the-art methods. Besides, the obtained results ensure the robustness and the generalization ability of the proposed model.

Full Text