Abstract
Recent progress in deep learning, in particular the generative models, makes it easier to synthesize sophisticated forged faces in videos, leading to severe threats on social media about personal privacy and reputation. It is therefore highly necessary to develop forensics approaches to distinguish those forged videos from the authentic. Existing works are absorbed in exploring frame-level cues but insufficient in leveraging affluent temporal information. Although some approaches identify forgeries from the perspective of motion inconsistency, there is so far not a promising spatiotemporal feature fusion strategy. Towards this end, we propose the Channel-Wise Spatiotemporal Aggregation (CWSA) module to fuse deep features of continuous video frames without any recurrent units. Our approach starts by cropping the face region with some background remained, which transforms the learning objective from manipulations to the difference between pristine and manipulated pixels. A deep convolutional neural network (CNN) with skip connections that are conducive to the preservation of detection-helpful low-level features is then utilized to extract frame-level features. The CWSA module finally makes the real or fake decision by aggregating deep features of the frame sequence. Evaluation against a list of large facial video manipulation benchmarks has illustrated its effectiveness. On all three datasets, FaceForensics++, Celeb-DF, and DeepFake Detection Challenge Preview, the proposed approach outperforms the state-of-the-art methods with significant advantages.
Highlights
Recent progress in deep learning, in particular the generative models, makes it easier to synthesize sophisticated forged faces in videos, leading to severe threats on social media about personal privacy and reputation
FaceForensics++, Celeb-DF, and DeepFake Detection Challenge Preview, the proposed approach outperforms the stateof-the-art methods with significant advantages
According to the clues used, the detection approaches of face video manipulation can be mainly divided into two: intraframe information based and interframe information based. e former focuses on spatial artifacts and realizes video manipulation detection by processing independent frames. e latter captures the dynamic flaws in videos through temporal models like Recurrent Neural Network (RNN) [3] or optical flow [4]
Summary
With the help of the Nonnegative Matrix Factorization model and histograms of Discrete Cosine Transform, multiple JPEG compression can be successfully detected and indirectly, the authenticity of images Another kind of popular approach is to discover clues that are related to the camera itself. Most dynamic artifacts based detection approaches utilize a CNN backbone to firstly extract features of every single frame. By modeling the face and head movements as the unique speaking pattern of a specific individual, the high prediction error can be a strong hint of fake Biological signals such as eye blinking and pulse are discriminating cues to expose DeepFakes. E proposed CWSA module recombines the feature maps into a new feature sequence which is compressed to a vector and connected to a single neural unit for real or fake classification. A single neural with sigmoid activation is connected to it and makes the classification fake or real. e pipeline of the proposed CWSA is summarized in Algorithm 1
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.