In recent years, public security incidents caused by deepfake technology have occurred frequently around the world, which makes an efficient and accurate deepfake detection model crucial. The existing advanced methods use the manipulation features in the image to realize the binary classification of real and fake images by training complex neural network models. However, these models rely on a single manipulation feature, and the detection accuracy of these methods will be greatly reduced when the forgery technology or image quality of the training data set and the validation data set are different. Inspired by the existing work, we propose a two-stream collaborative learning framework that combines spatial texture differences and frequency information. The average difference convolution (ADC) is designed to extract the spatial texture difference information of the image, and the gray image frequency-aware decomposition (GFAD) is used to extract the artifact information of the image in the frequency domain. At the same time, the ViT idea is combined with cross attention mechanism for feature fusion to comprehensively mine forged features in forged images. Experimental results show that the proposed model has good detection effects on three benchmark data sets. In terms of cross-dataset evaluation, the AUC on Celeb-DF dataset reaches 82.86%, which is better than the existing advanced methods.
Read full abstract