To address the issues of target feature blurring and increased false detections caused by high compression rates in deepfake videos, as well as the high computational resource requirements of existing face extractors, we propose a lightweight face extractor to assist deepfake detection, GCS-YOLOv8. Firstly, we employ the HGStem module for initial downsampling to address the issue of false detections of small non-face objects in deepfake videos, thereby improving detection accuracy. Secondly, we introduce the C2f-GDConv module to mitigate the low-FLOPs pitfall while reducing the model's parameters, thereby lightening the network. Additionally, we add a new P6 large target detection layer to expand the receptive field and capture multi-scale features, solving the problem of detecting large-scale faces in low-compression deepfake videos. We also design a cross-scale feature fusion module called CCFG (CNN-based Cross-Scale Feature Fusion with GDConv), which integrates features from different scales to enhance the model's adaptability to scale variations while reducing network parameters, addressing the high computational resource requirements of traditional face extractors. Furthermore, we improve the detection head by utilizing group normalization and shared convolution, simplifying the process of face detection while maintaining detection performance. The training dataset was also refined by removing low-accuracy and low-resolution labels, which reduced the false detection rate. Experimental results demonstrate that, compared to YOLOv8, this face extractor achieves the AP of 0.942, 0.927, and 0.812 on the WiderFace dataset's Easy, Medium, and Hard subsets, representing improvements of 1.1%, 1.3%, and 3.7% respectively. The model's parameters and FLOPs are only 1.68 MB and 3.5 G, reflecting reductions of 44.2% and 56.8%, making it more effective and lightweight in extracting faces from deepfake videos.
Read full abstract