Abstract

While the rapid expansion of DeepFake generation techniques has arisen a serious impact on human society, the detection of DeepFake videos is challenging because of their highly plausible contents on each frame, which are not visually apparent. To address that, this paper proposes a two-stream method to capture the spatial-temporal inconsistency cues, and then interactively fuse them to detect DeepFake videos. Since the traces of spatial inconsistency in DeepFake video frames mainly appear in their structural information, which reflects by the phase component in the frequency domain, the proposed frame-level stream learns the spatial inconsistency from the phase-based reconstructed frames to avoid fitting the content information. Aiming at the problem that the temporal inconsistency in DeepFake videos might be ignored, the temporality-level stream is proposed to extract the temporal correlation feature by the temporal difference networks and stacked ConvGRU module on consecutive multiple frames. When interacted with channel attention in the intermediate layer of two streams, and adaptively fused with the discriminative features of two streams from a global-local perspective, our proposed method performs better than the state-of-the-art detection methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call