Abstract

Synthesizing videos with forged faces is a fundamental yet important safety-critical task that has caused severe security issues in recent years. Although many existing face forgery detection methods have achieved superior performance on such synthetic videos, they are severely limited by the domain-specific training data and generally perform unsatisfied when transferred to the cross-dataset scenario due to the domain gaps. Based on this observation, in this paper, we propose a multi-level feature disentanglement network to be robust to this domain bias induced by the different types of fake artifacts in different datasets. Specifically, we first detect the face image and transform it into both color-aware and frequency-aware inputs for multi-modal contextual representation learning. Then, we introduce a novel feature disentangling module that mainly utilizes a pair of complementary attention maps, to disentangle the synthetic features into separate realistic features and the features of fake artifacts. Since the features of fake artifacts are indirectly obtained from the latent features instead of the dataset-specific distribution, our forgery detection model is robust to the dataset-specific domain gaps. By applying the disentangling module to multi-levels of the feature extraction network with multi-modal inputs, we can obtain more robust feature representations. In addition, a realistic-aware adversary loss and a domain-aware adversary loss are adopted to facilitate the network for better feature disentanglement and extraction. Extensive experiments on four datasets verify the generalization of our method and present the state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call