Abstract

Deepfake videos present a significant challenge in the current media landscape. While current deepfake detection methods demonstrate satisfactory performance, there is still room for improvement in their ability to generalize and detect unseen scenarios, particularly those involving imperceptible cues. This paper introduces a novel multi-modal deepfake detection model named SpectraVisionFusion Transformer (SVFT), which incorporates spatial and frequency domain statistical artifacts to improve generalization performance. The SVFT framework uses two different backbone encoder models to take advantage of both spatial and frequency domain cues in video sequences, along with a decoder and classifier, for common cross-attention and classification, respectively. The spatial domain branch uses a convolutional transformer-based encoder to analyze facial visual features. In contrast, the frequency domain branch employs a language transformer encoder. Additionally, we introduce a weighted feature embedding fusion mechanism that integrates spectral-based statistical feature embeddings and visual cues to achieve a more comprehensive and balanced spatial-frequency feature representation. By coordinately analyzing these modalities, our model exhibits improved detection and generalization capabilities in unseen scenarios. Our proposed SVFT model achieved 92.57% and 80.63% accuracy in extensive cross-manipulation and dataset evaluation, respectively, while surpassing the performance of traditional and single-domain-based approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call