Abstract

Significant progress has been made in generating counterfeit images and videos. Forged videos generated by deepfaking have been widely spread and have caused severe societal impacts, which stir up public concern about automatic deepfake detection technology. Recently, many deepfake detection methods based on forged features have been proposed. Among the popular forged features, textural features are widely used. However, most of the current texture-based detection methods extract textures directly from RGB images, ignoring the mature spectral analysis methods. Therefore, this research proposes a deepfake detection network fusing RGB features and textural information extracted by neural networks and signal processing methods, namely, MFF-Net. Specifically, it consists of four key components: (1) a feature extraction module to further extract textural and frequency information using the Gabor convolution and residual attention blocks; (2) a texture enhancement module to zoom into the subtle textural features in shallow layers; (3) an attention module to force the classifier to focus on the forged part; (4) two instances of feature fusion to firstly fuse textural features from the shallow RGB branch and feature extraction module and then to fuse the textural features and semantic information. Moreover, we further introduce a new diversity loss to force the feature extraction module to learn features of different scales and directions. The experimental results show that MFF-Net has excellent generalization and has achieved state-of-the-art performance on various deepfake datasets.

Highlights

  • Artificial intelligence has transformed all aspects of life, including facial recognition, fast identity authentication, logging into mobile apps, and making payments

  • We argue that using the features obtained by a signal processing method as auxiliary input can be more efficient for collecting textural and frequency features for the deepfake detection task

  • We are the first to combine a signal processing method with a neural network to fuse the textural and frequency features extracted by Gabor convolution with the shallow textural information of RGB images

Read more

Summary

Introduction

Artificial intelligence has transformed all aspects of life, including facial recognition, fast identity authentication, logging into mobile apps, and making payments. Most of them cannot be distinguished by human eyes. A survey by whichfaceisreal.com [5] shows that users cannot distinguish between real and fake images well. According to this whichfaceisreal.com survey, it is generally believed that human eye recognition is better than random guessing, users only achieve a maximum recognition accuracy of 75%. There are many mature tools for generating fake images and videos, such as FaceApp [6] and FaceSwap [7]. The DCT has general orthogonal transform properties, and its basis vectors can describe the relevant characteristics of human speech signals and image signals.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call