Automatic Speaker Verification (ASV) systems are crucial in various fields, enabling speaker identification for authentication, fraud detection, and forensic applications. While the simplicity and effectiveness of speech biometrics are driving the demand for ASV systems, their increasing popularity raises concerns about vulnerability to speech attacks. To enhance the security of these systems, the work in this paper proposes a spectrogram-based solution that leverages the robustness of spectrograms in audio analysis and feature extraction. The proposed model consists of two main components: frontend and backend. In the frontend, it introduces a novel spectrogram MelCochleaGram (MCG) by fusing Mel Spectrogram and Cochleagram, sequentially. For the backend implementation, pre-trained deep learning models, including ResNet50, ResNet50V2, and InceptionV3, are employed using the Keras framework. These models are individually paired with MCG to detect deepfake and replay attacks. To validate the effectiveness of the proposed system, thorough experimentation is conducted on two datasets: the DEepfake CROss-lingual (DECRO) evaluation dataset and the Voice Spoofing Detection Corpus (VSDC). The proposed combination of MCG with ResNet50 has achieved an Equal Error Rate (EER) of 0.2%, and 1.2% for deepfake detection over DECRO English and Chinese datasets, respectively. Also, for replay attack detection, the proposed combination has produced an EER of 1.4%.
Read full abstract